Data Jobs Salaries in November 2023 in Mexico (Part 2)¶


Author: Daniel Eduardo López

LinkedIn | GitHub

14 January 2024

pexels-photo-577210_resized.jpg

Image Credit: Lukas from Pexels.


Table of Contents¶


  1. Introduction
    1.1 Background
    1.2 General Objective
    1.3 Research Question
    1.4 Hypothesis
    1.5 Abridged Methodology
    1.6 Notebook's Goal
  2. Data Collection
  3. Data Exploration
    3.1 Data Description
    3.2 Data Quality
    3.3 Exploratory Data Analysis (EDA)
  4. Data Preparation
    4.1 Drop of Duplicates
    4.2 Cleaning of the Attribute Job
    4.3 Cleaning of the Attribute Company
    4.4 Cleaning of the Attribute Location
    4.5 Cleaning of the Attribute Salary
    4.6 Processed Dataset Exporting
    4.7 Salary Dataset
  5. Data Analysis & Visualization
    5.1 What is the most demanded data job?
    5.2 Where do the most data jobs locate?
    5.3 How is the data jobs demand per location?
    5.4 What are the companies demanding more data jobs?
    5.5 How is the data jobs demand per company?
    5.6 Where do the companies demanding data jobs locate?
    5.7 How many salary observations are for each data job category?
    5.8 Which data job category has the highest salary?
    5.9 Where the highest salaries can be found?
    5.10 What companies offer the highest salaries?
    5.11 What companies offer the highest salaries per data job category?
    5.12 What vacancies offer the highest salaries?
  6. Statistical Analysis
    6.1 How is the salary distribution of the data jobs?
    6.2 Are the salary observations following a normal distribution?
    6.3 How are the salary distributions by data job category?
    6.4 Are the salary differences among Data Jobs statistically significant?
    6.5 Which salary differences among Data Jobs are statistically significant?
    6.6 Are the Data Architect salaries significantly lower than those for ML Engineers?
    6.7 Are the Data Engineer salaries significantly lower than those for ML Engineers?
    6.8 Are the Data Scientist salaries significantly lower than those for ML Engineers?
    6.9 Are the Business Analyst salaries significantly lower than those for ML Engineers?
    6.10 Are the BI Analyst salaries significantly lower than those for ML Engineers?
    6.11 Are the Data Analyst salaries significantly lower than those for ML Engineers?
    6.12 Are the Data Architect salaries significantly higher than those for Data Engineers?
    6.13 Are the Data Architect salaries significantly higher than those for Data Scientists?
    6.14 Are the Data Architect salaries significantly higher than those for Business Analysts?
    6.15 Are the Data Architect salaries significantly higher than those for BI Analysts?
    6.16 Are the Data Architect salaries significantly higher than those for Data Analysts?
    6.17 Are the Data Engineer salaries significantly higher than those for Data Scientists?
    6.18 Are the Data Engineer salaries significantly higher than those for Business Analysts?
    6.19 Are the Data Engineer salaries significantly higher than those for BI Analysts?
    6.20 Are the Data Engineer salaries significantly higher than those for Data Analysts?
    6.21 Are the Data Scientist salaries significantly higher than those for Business Analysts?
    6.22 Are the Data Scientist salaries significantly higher than those for BI Analysts?
    6.23 Are the Data Scientist salaries significantly higher than those for Data Analysts?
    6.24 Are the Business Analyst salaries significantly higher than those for BI Analysts?
    6.25 Are the Business Analyst salaries significantly higher than those for Data Analysts?
    6.26 Are the BI Analyst salaries significantly higher than those for Data Analysts?
  7. Conclusions
  8. References

1. Introduction ¶


1.1 Background ¶

With the emergence of the big data, new jobs have appeared demanding new sets of skills and expertise for extracting value from data (Axis Talent, 2020; ai-jobs.net, 2023):

  • Business Intelligence Analysts (BI)
  • Business Analysts (BA)
  • Data Analysts (DA)
  • Data Architects (DR)
  • Data Engineers (DE)
  • Data Scientists (DS)
  • Machine Learning Engineers (ML)

Which one is the most valued in the Mexican labor market currently?

1.2 General Objective ¶

To identify which data job category has the highest salary in the Mexican labor market in November 2023 according to the OCC website.

1.3 Research Question ¶

Which data job category has the highest salary in the Mexican labor market in November 2023 according to the OCC website?

1.4 Hypothesis ¶

The Data Architect position has the highest salary in the Mexican labor market in November 2023 according to the OCC website.

1.5 Abridged Methodology ¶

The methodology of the present study is based on Rollin’s Foundational Methodology for Data Science (Rollins, 2015).

  1. Analytical approach: Descriptive and inferential statistics.
  2. Data requirements: Data about job positions such as job name, salary, company and location.
  3. Data collection: Data was collected from the OCC Website (Mexico) on 26 November 2023, through web scraping with Python 3 and its libraries Selenium, BeautifulSoup, and Regex.
  4. Data exploration: Data was explored with Python 3 and its libraries Pandas and Seaborn.
  5. Data preparation: Data then was cleaned and wrangled with Python 3 and its libraries Numpy and Pandas.
  6. Data analysis and visualization: Data was analyzed with Python 3 and its libraries Pandas, Scipy and Statsmodels and visualized with Matplotlib, Seaborn, Folium and Plotly.
  7. Statistical analysis: The D'Agostino-Pearson normality test was used to assess the normality of the data jobs salary distribution. Then, both parametric (ANOVA, Tukey-Kramer, one-sample and two-sample T-tests) and non-parametric (Kruskal-Wallis H, Dunn, Wilcoxon signed-rank, and Mann-Whitney U) tests were carried out to assess the significance of the obtained results. Furthermore, an effect size analysis was carried out by computing the absolute mean salary difference, the percentage difference, the Cohen’s d, and the bootstrap confidence intervals for the mean for each data job category. This, in order to assess whether the salary differences were also significant from a practical point of view.
  8. Implementation: An interactive dashboard was built with Python 3 and its libraries Dash and Plotly, and deployed on Render.

Furthermore:

  1. A final report was written with the complete results obtained from the data.
  2. Some slides were prepared with the most important insights from the report.

The full project can be found at the GitHub repository.

1.6 Notebook's Goal ¶

In this context, the purpose of the present notebook is to perform the process of data exploration, data preparation, data analysis, data visualization as well as the statistical analysis.

In [ ]:
# Loading Requirements Text File
!pip install -r requirements.txt
In [ ]:
# Libraries installation
# !pip install seaborn
# !pip install folium
# !pip install plotly
# !pip install scipy
# !pip install statsmodels
# !pip install -U kaleido
# !pip install nbformat
# !pip install scikit_posthocs 
In [ ]:
# Libraries importation
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib_inline.backend_inline import set_matplotlib_formats
import seaborn as sns
import folium
from folium.plugins import MarkerCluster
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.io as pio
import scipy.stats as stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
from scikit_posthocs import posthoc_tukey, posthoc_dunn
from sklearn.utils import resample
import re
import os
In [ ]:
# Matplolib settings

# Setting theme and plot resolution
sns.set_theme(context = 'notebook', style = 'darkgrid')
mpl.rcParams["figure.dpi"] = 100
mpl.rcParams["savefig.dpi"] = 300
set_matplotlib_formats('svg')

# Setting default plot's aesthetics
plotfontcolor = 'dimgray'
mpl.rcParams['text.color'] = plotfontcolor
mpl.rcParams['axes.labelcolor'] = plotfontcolor
mpl.rcParams['xtick.color'] = plotfontcolor
mpl.rcParams['ytick.color'] = plotfontcolor
mpl.rcParams["font.size"] = 10
mpl.rcParams['axes.titlesize'] = 14
mpl.rcParams['axes.labelsize'] = 12
plt.rcParams['xtick.labelsize'] = 10
plt.rcParams['ytick.labelsize'] = 10
plt.rcParams['legend.fontsize'] = 11
mpl.rcParams["axes.labelweight"] = "bold"
mpl.rcParams["axes.titleweight"] = "bold"
#mpl.rcParams['font.family'] = 'sans-serif'
#mpl.rcParams['font.family'] = 'serif'
In [ ]:
# Plotly settings

# Setting Notebook's output rendering
pio.renderers.default = "svg"

# Configuration for exporting Plotly charts for publication
config = {
  'toImageButtonOptions': {
    'format': 'png', # one of png, svg, jpeg, webp
    'filename': 'Plot',
    'scale': 2 # Multiply title/legend/axis/canvas sizes by this factor
  }
}

# Setting Font Properties for Plotly
font_px = dict(size=14, color ='dimgray')

# Setting Legend aesthetics for Plotly
legend_px = dict(bgcolor='#f0f0f0',
                  bordercolor='#cbcccd',
                  borderwidth=1.5)

# Setting Heatmap Color Scale for Plotly
heatmap_px_colorscale = [(0, "#000000"), 
                        (0.03, px.colors.sequential.Blues[8]), 
                        (0.25, px.colors.sequential.Blues[6]), 
                        (0.5, px.colors.sequential.Blues[4]), 
                        (0.75, px.colors.sequential.Blues[2]), 
                        (1, "aliceblue")]

heatmap_px_categoryarray = ['BI Analyst', 
                            'Business Analyst', 
                            'Data Analyst', 
                            'Data Architect', 
                            'Data Engineer', 
                            'Data Scientist',
                            'ML Engineer']

2. Data Collection ¶


As shown in the previous notebook, and in accordance with the defined data requirements: job name, salary, company and location; the data was collected by performing a web scraping process on the the OCC Website (Mexico) on 26 November 2022. Python 3 and its libraries Selenium and BeautifulSoup were used in this process.

To do so, key words were used (in both English and Spanish) for encompassing the above-mentioned data jobs:

jobs_list = ["analista datos",
           "data analyst",
           "cientifico datos",
           "data scientist",
           "ingeniero datos",
           "data engineer",
           "arquitecto datos",
           "data architect",
           "analista negocio",
           "business analyst",
           "bi analyst",
           "business intelligence",
           "aprendizaje automatico",
           "machine learning"]

Furthermore, the most important challenge faced during the step of data collection were the dynamic class ids in the OCC website. Indeed, it was found that the class ids for the salary, location and company tags changed every few minutes. Thus, regular expressions were used to scrape as much as possible data from the html. Please refer to the notebook for all the details.

The collected dataset was stored in GitHub repository from this project.

In [ ]:
# Data importation
df = pd.read_csv('https://raw.githubusercontent.com/DanielEduardoLopez/DataJobs-Nov2023-MX/main/Dataset_raw.csv')

3. Data Exploration ¶


After the data collected from the web scraping process was loaded to a Pandas dataframe. Then, the dataset was explored with the Pandas and Seaborn libraries to gain some understanding and preliminary insights.

3.1 Data Description ¶

In [ ]:
# Preview of the dataset
df.head(10)
Out[ ]:
Job Salary Company Location
0 Analista de datos $20,000 - $25,000 Mensual SKANDIA OPERADORA DE F... Polanco II Sección, Miguel Hidalgo, CDMX
1 Analista de Datos Sueldo no mostrado por la empresa GENERAL DE SEGUROS CDMX
2 Analista de datos MDM $17,000 - $20,000 Mensual Grupo Alnus S.A. de C.V. Monterrey, N.L.
3 ANALISTA DE DATOS ECOMMERCE $16,000 Mensual Grupo Daisa Cuauhtémoc, CDMX
4 Analista de datos maestros $10,000 Mensual Empresa confidencial Monterrey, N.L.
5 Coordinador analista de datos Sueldo no mostrado por la empresa Empresa confidencial Miguel Hidalgo, CDMX
6 ANALISTA DE DATOS / EXPERTO EN EXCEL $11,000 - $13,000 Mensual Empresa confidencial Zapopan, Jal.
7 Analista de Datos en Azure (6 meses) $55,000 - $60,000 Mensual SYGNO Monterrey, N.L.
8 Analista de Datos (Power Bi-Tableau) $29,000 Mensual Empresa confidencial CDMX
9 Analista de base de datos $25,000 - $27,000 Mensual SERVICIOS DE ENERGIA E... Monterrey, N.L.
In [ ]:
df.tail(10)
Out[ ]:
Job Salary Company Location
3812 Trabajo Desde Casa Desarrollador .NET Backend ... Sueldo no mostrado por la empresa BAIRESDEV LLC / (Tax I... Guadalajara, Jal.
3813 Work From Home Junior QA Automation / R+D/ Ref... Sueldo no mostrado por la empresa BAIRESDEV LLC / (Tax I... Guadalajara, Jal.
3814 Work From Home .NET + Angular Architect / R + ... Sueldo no mostrado por la empresa BAIRESDEV LLC / (Tax I... Guadalajara, Jal.
3815 Work From Home SemiSenior Angular Developer / ... Sueldo no mostrado por la empresa BAIRESDEV LLC / (Tax I... Guadalajara, Jal.
3816 Work From Home QA Automation Developer / R + D... Sueldo no mostrado por la empresa BAIRESDEV LLC / (Tax I... Guadalajara, Jal.
3817 Work From Home Python Tech Lead / Research + D... Sueldo no mostrado por la empresa BAIRESDEV LLC / (Tax I... Guadalajara, Jal.
3818 Trabajo Desde Casa Desarrollador Junior/MidLev... Sueldo no mostrado por la empresa BAIRESDEV LLC / (Tax I... Guadalajara, Jal.
3819 Work From Home Angular Architect / R+D / Ref. ... Sueldo no mostrado por la empresa BAIRESDEV LLC / (Tax I... Guadalajara, Jal.
3820 Work From Home Angular Tech Lead / R+D / Ref. ... Sueldo no mostrado por la empresa BAIRESDEV LLC / (Tax I... Guadalajara, Jal.
3821 Work From Home Senior Angular Developer / R+D ... Sueldo no mostrado por la empresa BAIRESDEV LLC / (Tax I... Guadalajara, Jal.
In [ ]:
df.columns
Out[ ]:
Index(['Job', 'Salary', 'Company', 'Location'], dtype='object')

In accordance with the set data requirements, the dataset contains four columns: Job, Salary, Company, and Location.

It can be observed that data lacks consistency in every variable. Some observations are in uppercase, salary values are strings, locations descriptions comprise the state and sometimes include the city, etc.

Furthermore, it is also clear that some jobs does not conform with any of the previously defined data jobs, so, the dataset must be cleaned appropriately.

Nonetheless, some further exploration will be carried out to develop a more complete strategy for doing so.

In [ ]:
# Dataset size
df.shape
Out[ ]:
(3822, 4)

So, we have 3822 original observations in total.

In [ ]:
# Basic dataset info
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3822 entries, 0 to 3821
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Job       3822 non-null   object
 1   Salary    3822 non-null   object
 2   Company   3822 non-null   object
 3   Location  3606 non-null   object
dtypes: object(4)
memory usage: 119.6+ KB

Firstly, all of the columns are strings. This is expectable for the Job, Company, and Location variables; but no for Salary, which must be appropriately wrangled.

From the results above it also can be seen that the variables Job, Salary and Location do not exhibit null values; whereas Location do. Indeed, in a few cases, the vacancy did not disclose the working location.

In [ ]:
# Basic dataset description
df.describe()
Out[ ]:
Job Salary Company Location
count 3822 3822 3822 3606
unique 924 249 419 145
top Data Scientist Sueldo no mostrado por la empresa Empresa confidencial CDMX
freq 70 2346 534 1076

From the results above, it is noteworthy that Data Scientist is the most common job in the dataset; whereas most of the vacancies correspond to confidential salaries and companies.

In addition to that, there were 924 different job titles, which should be conformed to the 7 defined data jobs above:

  • Business Analysts (BA)
  • Data Analysts (DA)
  • Data Architects (DR)
  • Data Engineers (DE)
  • Data Scientists (DS)
  • Business Intelligence Analysts (BI)
  • Machine Learning Engineers (ML)

3.2 Data Quality ¶

Missing values ¶

In the present section, the dataset was assessed to identify the existence of missing values in the dataset,

In [ ]:
# Function to calculate the percentage of missing values for each column in the dataset

def missing_values_table(df):
        """
        Calculates the number of missing values and its corresponding percentage of total values 
        for each attribute in a pandas dataframe. 

        Parameters

        df: Dataset in the format of a Pandas dataframe

        Returns

        mis_val_table: Table with the missing values percentage for each attribute

        """        
        mis_val = df.isnull().sum()
        mis_val_percent = (mis_val / len(df)) * 100
        mis_val_table = pd.concat([pd.Series(mis_val.index), pd.Series(mis_val.values), 
                                   pd.Series(mis_val_percent.values)], axis=1)
        mis_val_table = mis_val_table.rename(
        columns = {0: 'Attribute', 1 : 'Missing Values', 2: '% of Total Values'})
        mis_val_table = mis_val_table[
            mis_val_table.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        print ("Dataset has " + str(df.shape[1]) + " columns.\n"      
            "There are " + str(mis_val_table.shape[0]) + " attributes that have missing values.")
        
        return mis_val_table
In [ ]:
missing_values_table(df)
Dataset has 4 columns.
There are 1 attributes that have missing values.
Out[ ]:
Attribute Missing Values % of Total Values
3 Location 216 5.7

Thus, about 5.7% of the observations in the location variable are null.

In view of the low percent of missing values, it is possible to safely remove the observed vacancies that do not state a working location. However, these vacancies might correspond to remote working locations, so, it has preferible to impute the missing values with "Remote/NA".

Data Completeness ¶

In this section, it was assessed whether the dataset cover all the cases required, which, in this case, are observations for each data job.

In [ ]:
# Job list definition

jobs_list = ["analista de datos",
           "data analyst",
           "cientifico de datos",
           "data scientist",
           "ingeniero de datos",
           "data engineer",
           "arquitecto de datos",
           "data architect",
           "analista de negocio",
           "business analyst",
           "bi analyst",
           "business intelligence",
           "aprendizaje automatico",
           "machine learning"]
In [ ]:
# Assessment of completeness

completeness = pd.DataFrame({'Job': 0, 'True Ocurrences': 0, 'False Ocurrences': 0}, index=[0])

for job in jobs_list:   
    bool_series = df['Job'].str.lower().isin([job]).value_counts()
    false_ocur = bool_series.values[0]   

    try: 
        true_ocur = bool_series.values[1]        
    
    except:
        true_ocur = 0
    
    new_row = pd.DataFrame({'Job': job, 'True Ocurrences': true_ocur, 'False Ocurrences': false_ocur}, index=[0])
    completeness = pd.concat([completeness, new_row]).reset_index(drop=True)
    
completeness = completeness.iloc[1:]
completeness
Out[ ]:
Job True Ocurrences False Ocurrences
1 analista de datos 120 3702
2 data analyst 48 3774
3 cientifico de datos 8 3814
4 data scientist 78 3744
5 ingeniero de datos 66 3756
6 data engineer 88 3734
7 arquitecto de datos 0 3822
8 data architect 22 3800
9 analista de negocio 0 3822
10 business analyst 54 3768
11 bi analyst 10 3812
12 business intelligence 12 3810
13 aprendizaje automatico 0 3822
14 machine learning 0 3822

So, taking into account the job titles in both English and Spanish, there are observations for almost all data jobs except for the position *Machine Learning Engineer, which is not the ideal scenario, but it is enough to proceed with the present analysis.

Incorrect Data Types ¶

In this section, the data types for each attribute and observation were assessed to identify incorrect data types.

In [ ]:
# Setting columns by data type
string_cols = ['Job','Company', 'Location']
numerical_cols = ['Salary']

# regex pattern to identify a number
number_pattern = "^\d+\.?\d*$" 
In [ ]:
# Creating empty lists
numbers_list = []
incorrect_numeric_list = []
incorrect_string_list = []

# Detecting incorrect data types in expected string columns
for col in df[string_cols].columns: 
  i = 0
  for observation in df[string_cols][col].values:
    if re.findall(number_pattern, str(observation)):
      incorrect_string_list.append({"Attribute": col, "Index": i, "Incorrect Datatype Observation": observation, 'Expected Datatype': 'string'})
    i += 1

# Detecting incorrect data types in expected numerical columns
for col in df[numerical_cols].columns: 
  i = 0
  for observation in df[numerical_cols][col].values:
    if re.findall(number_pattern, str(observation)):
      numbers_list.append(observation)
    if observation not in numbers_list:
      incorrect_numeric_list.append({"Attribute": col, "Index": i, "Incorrect Datatype Observation": observation, 'Expected Datatype': 'float'})
    i += 1

# Displaying results as a dataframe
incorrect_data_types = pd.DataFrame(data = (incorrect_string_list + incorrect_numeric_list))
incorrect_data_types
Out[ ]:
Attribute Index Incorrect Datatype Observation Expected Datatype
0 Salary 0 $20,000 - $25,000 Mensual float
1 Salary 1 Sueldo no mostrado por la empresa float
2 Salary 2 $17,000 - $20,000 Mensual float
3 Salary 3 $16,000 Mensual float
4 Salary 4 $10,000 Mensual float
... ... ... ... ...
3817 Salary 3817 Sueldo no mostrado por la empresa float
3818 Salary 3818 Sueldo no mostrado por la empresa float
3819 Salary 3819 Sueldo no mostrado por la empresa float
3820 Salary 3820 Sueldo no mostrado por la empresa float
3821 Salary 3821 Sueldo no mostrado por la empresa float

3822 rows × 4 columns

As noted above, the Salary attribute exhibited an incorrect string data type for all the observations. Thus, the non-numeric characters should be removed and the ranges should be transformed into an average.

3.3 Exploratory Data Analysis ¶

In this section, the data was explored.

In [ ]:
# Value counts for each job position
df[['Job']].value_counts()
Out[ ]:
Job                     
Data Scientist              70
Analista de Datos           62
Data engineer               58
Ingeniero de datos          46
Analista de datos           40
                            ..
CARRIER ANALYST              2
CATMAN - BI                  2
CHOFER MENSAJERO             2
CHOFER REFACCIONES           2
Data Quality Sr. Analyst     2
Name: count, Length: 924, dtype: int64

According to the result above, Data Scientist is the most demanded data job. However, it is noteworthy that Analista de datos (Spanish for Data Analyst) appears several times in the list with different combinations of upper and lower case. So, before to visually exploring the dataset, it is a good idea to homologate all the observations to lower case.

932 different positions titles are present in the data set, while only 7 are required!

In [ ]:
# Visual exploration of the Job variable (top 20)
fig, ax = plt.subplots(figsize = (7, 7))
sns.barplot(x = df['Job'].str.lower().value_counts().values[0:20], 
            y = df['Job'].str.lower().value_counts().index[0:20], 
            hue = df['Job'].value_counts().index[0:20],            
            palette="Blues_r",             
            legend = False,
            alpha = 0.9)
plt.xlabel('Frequency')
plt.ylabel('Job title')
plt.title('Most Common Data Jobs in Dataset')
plt.show()
2024-01-17T10:13:22.088790 image/svg+xml Matplotlib v3.7.4, https://matplotlib.org/

It appears that Analista de datos (Spanish for Data Analyst) is the most demanded position, followed by Data engineer and Data scientist.

Notwithstanding the above, it is necessary to homologate the job titles that are in English and Spanish; as well as homologating the different variations for the same job title.

On the other hand, other job titles appear in the data set that are errors or do not correspond to data jobs.

In [ ]:
# Visual exploration of the Job variable
fig, ax = plt.subplots(figsize = (10, 7))
sns.scatterplot(x = df['Job'].str.lower().unique(), 
                y = df['Job'].str.lower().value_counts(), 
                color="Green", alpha = 0.9)
plt.xticks([])
plt.xlabel('Job Title')
plt.ylabel('Frequency')
plt.title('Job Titles Frequency in Dataset')
plt.show()
2024-01-17T10:13:24.195789 image/svg+xml Matplotlib v3.7.4, https://matplotlib.org/

The plot above shows that the largest part of the job titles retrieved from the web scraping are unique values. Thus, this suggests that the job titles variable is fairly inconsistent and/or it exhibits a lot of other positions which are not of interest for this study.

In [ ]:
# Value counts for each job position
df[['Location']].value_counts()
Out[ ]:
Location                         
CDMX                                 1076
Miguel Hidalgo, CDMX                  406
Guadalajara, Jal.                     392
Monterrey, N.L.                       238
Cuauhtémoc, CDMX                       72
                                     ... 
Lomas Altas, Miguel Hidalgo, CDMX       2
Lerma, Edo. Méx.​                       2
La Paz, BCS.                            2
La Magdalena Contreras, CDMX            2
Mexicali, BC.                           2
Name: count, Length: 145, dtype: int64
In [ ]:
# Visual exploration of the Location variable (top 20)
fig, ax = plt.subplots(figsize = (7, 7))
sns.barplot(x = df['Location'].value_counts().values[0:20], 
            y = df['Location'].value_counts().index[0:20], 
            hue = df['Location'].value_counts().index[0:20], 
            legend = False,
            palette="Blues_r", 
            alpha = 0.9)
plt.xlabel('Frequency')
plt.ylabel('Location')
plt.title('Most Common Locations in Dataset')
plt.show()
2024-01-17T10:13:24.840790 image/svg+xml Matplotlib v3.7.4, https://matplotlib.org/

As shown in the plot above, in some cases, only the name of the state appears while in others the name of the municipality is also included. So, a further cleaning is necessary to map all the locations to their corresponding Mexican states.

Notwithstanding with the above, it is clear that CDMX (Mexico City) is the location where the most data jobs are demanded.

In [ ]:
# Value counts for each job position
df[['Company']].value_counts()
Out[ ]:
Company                  
Empresa confidencial         534
BAIRESDEV LLC / (Tax I...    524
Banamex                      204
Adecco                        60
Manpower, S.A. de C.V.        56
                            ... 
ETAM SAPI DE CV                2
Raul Barragán de la Cruz       2
Red Acero                      2
Ricoh Mexicana S.A. de...      2
DATALOGIC SA de CV             2
Name: count, Length: 419, dtype: int64
In [ ]:
# Visual exploration of the Company variable (top 20)
fig, ax = plt.subplots(figsize = (7, 7))
sns.barplot(x = df['Company'].value_counts().values[0:20], 
            y = df['Company'].value_counts().index[0:20],
            hue =  df['Company'].value_counts().index[0:20],
            legend = False,
            palette="Blues_r", 
            alpha = 0.9)
plt.xlabel('Frequency')
plt.ylabel('Company')
plt.title('Most Common Companies in Dataset')
plt.show()
2024-01-17T10:13:25.500791 image/svg+xml Matplotlib v3.7.4, https://matplotlib.org/

In the dataset, most of the vacancies correspond to confidential companies; however, from those whose a hiring company is available, BAIRESDEV is the company with the highest number of data jobs vacancies. It is followed by Banamex, Adecco and Manpower.

In [ ]:
# Visual exploration of the Company variable
fig, ax = plt.subplots(figsize = (10, 7))
sns.scatterplot(x = df['Company'].value_counts().index, 
                y = df['Company'].value_counts().values, 
                hue = df['Company'].value_counts().values, 
                legend = False,
                color="Salmon", 
                alpha = 0.9)
plt.xticks([])
plt.xlabel('Company')
plt.ylabel('Frequency')
plt.title('Companies Frequency in Dataset')
plt.show()
2024-01-17T10:13:26.801790 image/svg+xml Matplotlib v3.7.4, https://matplotlib.org/

The plot above shows that most companies only offer one or two data-related positions.

In [ ]:
# Value counts for each Salary
df['Salary'].value_counts()
Out[ ]:
Salary
Sueldo no mostrado por la empresa    2346
$60,000 - $70,000  Mensual             38
$25,000 - $30,000  Mensual             36
$30,000 - $35,000  Mensual             30
$20,000 - $25,000  Mensual             28
                                     ... 
$11,000 - $11,300  Mensual              2
$10,400  Mensual                        2
$6,600  Mensual                         2
$6,999 - $7,000  Mensual                2
$18,000 - $25,000  Mensual              2
Name: count, Length: 249, dtype: int64
In [ ]:
# Visual exploration of the Company variable (top 20)
fig, ax = plt.subplots(figsize = (7, 7))
sns.barplot(x = df['Salary'].value_counts().values[0:20], 
            y = df['Salary'].value_counts().index[0:20], 
            hue = df['Salary'].value_counts().index[0:20], 
            legend = False,
            palette ="Blues_r", 
            alpha = 0.9)
plt.xlabel('Frequency')
plt.ylabel('Salary')
plt.title('Salaries in Dataset')
plt.show()
2024-01-17T10:13:28.093789 image/svg+xml Matplotlib v3.7.4, https://matplotlib.org/

Obviously, is necessary to clean the data in order to get more meaningful insights regarding the salary. However, it is noteworthy that most of the published vacancies have an undisclosed salary ("Sueldo no mostrado por la empresa").

For the vacancies that have a disclosed salary, a salary range is most commonly published rather than a single value. And, in this sense, it appears that a salary from 60,000 MXN to 70,000 MXN is the most common, which is probably not accurate due to the fact that the salary variable is a string instead of an actual number.

Of course, the periodicity of the payment ("Mensual": monthly) must be removed from all observations.


4. Data Preparation ¶


After the data was explored, some wrangling strategies to effectively clean the data were defined. In this sense, some of the procedures applied to the data were:

  • Drop of duplicates
  • Imputation of null values
  • Harmonization of terms
  • Trimming and cleaning of values
  • Slicing of observations
  • Split of columns
  • Altering of data types

From the raw data obtained through web scraping, it has been noticed that much of the vacancies published lack of salary data. However, those positions without said data might be useful for assessing the geographical distribution of the data jobs in Mexico as well as identiying which companies demand more data specialists even if they have not published a proposed salary.

Thus, two datasets were prepared:

  1. A dataset with all the vacancies (df).
  2. A dataset with only the vacancies with disclosed salary data (_salarydf).

4.1 Drop of the duplicates ¶

As some companies republish their vacancies in the website in order for their vacancies to appear in the first places of the search results, it deemed necessary to drop the duplicate positions.

In [ ]:
# Drop of duplicates
df = df.drop_duplicates()
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 1182 entries, 0 to 3541
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Job       1182 non-null   object
 1   Salary    1182 non-null   object
 2   Company   1182 non-null   object
 3   Location  1117 non-null   object
dtypes: object(4)
memory usage: 46.2+ KB
In [ ]:
# Resetting index
df = df.reset_index().drop(columns='index')
df.head(20)
Out[ ]:
Job Salary Company Location
0 Analista de datos $20,000 - $25,000 Mensual SKANDIA OPERADORA DE F... Polanco II Sección, Miguel Hidalgo, CDMX
1 Analista de Datos Sueldo no mostrado por la empresa GENERAL DE SEGUROS CDMX
2 Analista de datos MDM $17,000 - $20,000 Mensual Grupo Alnus S.A. de C.V. Monterrey, N.L.
3 ANALISTA DE DATOS ECOMMERCE $16,000 Mensual Grupo Daisa Cuauhtémoc, CDMX
4 Analista de datos maestros $10,000 Mensual Empresa confidencial Monterrey, N.L.
5 Coordinador analista de datos Sueldo no mostrado por la empresa Empresa confidencial Miguel Hidalgo, CDMX
6 ANALISTA DE DATOS / EXPERTO EN EXCEL $11,000 - $13,000 Mensual Empresa confidencial Zapopan, Jal.
7 Analista de Datos en Azure (6 meses) $55,000 - $60,000 Mensual SYGNO Monterrey, N.L.
8 Analista de Datos (Power Bi-Tableau) $29,000 Mensual Empresa confidencial CDMX
9 Analista de base de datos $25,000 - $27,000 Mensual SERVICIOS DE ENERGIA E... Monterrey, N.L.
10 Analista de Bases de Datos BI $28,000 - $29,000 Mensual Empresa confidencial CDMX
11 Analista de Gestión de Datos Jr. $11,000 - $11,500 Mensual Pepsico División Foods Hermosillo, Son.
12 Data Analyst Sueldo no mostrado por la empresa Michael Page NaN
13 Data Analyst $60,000 - $65,000 Mensual Alia San José Insurgentes, Benito Juárez, CDMX
14 Data analyst $30,000 - $35,000 Mensual Empresa confidencial Miguel Hidalgo, CDMX
15 Data analyst Sueldo no mostrado por la empresa Carl Zeiss de México, ... CDMX
16 Portfolio Data Analyst Sueldo no mostrado por la empresa Banamex CDMX
17 Data Warehouse Sr. Consultant Sueldo no mostrado por la empresa Banamex CDMX
18 ANALISTA Y SOPORTE TI $15,000 - $20,000 Mensual Empresa confidencial Cuajimalpa de Morelos, CDMX
19 GESTOR DE DATOS DE PRUEBA JR Sueldo no mostrado por la empresa Getecsa Miguel Hidalgo, CDMX

4.2 Cleaning of the Attribute Job ¶

This section shows the process carried out to harmonize the job titles of the data-related positions and the drop of non data-related jobs.

To do so, a data jobs dictionary using keys in both English and Spanish was defined. If a certain vacancy comprised all the terms in the keys, then, the job title was renamed to a standard job title.

As a backup, a copy of the original job titles were kept in another column for comparison purposes.

In [ ]:
# Creating a new column for keeping the original job titles
df['Original Job Title'] = df['Job']
df = df[['Original Job Title', 'Job', 'Salary', 'Company', 'Location']]
df.head()
Out[ ]:
Original Job Title Job Salary Company Location
0 Analista de datos Analista de datos $20,000 - $25,000 Mensual SKANDIA OPERADORA DE F... Polanco II Sección, Miguel Hidalgo, CDMX
1 Analista de Datos Analista de Datos Sueldo no mostrado por la empresa GENERAL DE SEGUROS CDMX
2 Analista de datos MDM Analista de datos MDM $17,000 - $20,000 Mensual Grupo Alnus S.A. de C.V. Monterrey, N.L.
3 ANALISTA DE DATOS ECOMMERCE ANALISTA DE DATOS ECOMMERCE $16,000 Mensual Grupo Daisa Cuauhtémoc, CDMX
4 Analista de datos maestros Analista de datos maestros $10,000 Mensual Empresa confidencial Monterrey, N.L.
In [ ]:
# Creating job dictionary
job_dict = {
    ('data', 'analyst'): 'Data Analyst',
    ('analista', 'datos'): 'Data Analyst',

    ('data', 'scientist'): 'Data Scientist',
    ('data', 'science'): 'Data Scientist',
    ('científico', 'datos'): 'Data Scientist',
    ('cientifico', 'datos'): 'Data Scientist',

    ('data', 'engineer'): 'Data Engineer',
    ('ingeniero', 'datos'): 'Data Engineer',
    
    ('data', 'architect'): 'Data Architect',
    ('arquitecto', 'datos'): 'Data Architect',

    ('business', 'intelligence'): 'BI Analyst',
    ('bi', 'analyst'): 'BI Analyst',
    ('analista', 'bi'): 'BI Analyst',
    ('inteligencia', 'negocios'): 'BI Analyst',

    ('business', 'analyst'): 'Business Analyst',
    ('analista', 'negocio'): 'Business Analyst',

    ('machine', 'learning'): 'ML Engineer',
    ('aprendizaje' , 'automatico'): 'ML Engineer'
    
}
In [ ]:
# Proof of concept
list(job_dict.keys())[0][0] in 'data analyst' and list(job_dict.keys())[0][1] in 'data analyst'
Out[ ]:
True
In [ ]:
# Conversion of job observations to lower case
df['Job'] = df['Job'].str.casefold()
In [ ]:
# Cleaning of the variable Job
for i in range(len(job_dict)):
    df['Job'] = df['Job'].map(lambda x: list(job_dict.values())[i] if (list(job_dict.keys())[i][0] in x and list(job_dict.keys())[i][1] in x) else x)

df.head(10)
Out[ ]:
Original Job Title Job Salary Company Location
0 Analista de datos Data Analyst $20,000 - $25,000 Mensual SKANDIA OPERADORA DE F... Polanco II Sección, Miguel Hidalgo, CDMX
1 Analista de Datos Data Analyst Sueldo no mostrado por la empresa GENERAL DE SEGUROS CDMX
2 Analista de datos MDM Data Analyst $17,000 - $20,000 Mensual Grupo Alnus S.A. de C.V. Monterrey, N.L.
3 ANALISTA DE DATOS ECOMMERCE Data Analyst $16,000 Mensual Grupo Daisa Cuauhtémoc, CDMX
4 Analista de datos maestros Data Analyst $10,000 Mensual Empresa confidencial Monterrey, N.L.
5 Coordinador analista de datos Data Analyst Sueldo no mostrado por la empresa Empresa confidencial Miguel Hidalgo, CDMX
6 ANALISTA DE DATOS / EXPERTO EN EXCEL Data Analyst $11,000 - $13,000 Mensual Empresa confidencial Zapopan, Jal.
7 Analista de Datos en Azure (6 meses) Data Analyst $55,000 - $60,000 Mensual SYGNO Monterrey, N.L.
8 Analista de Datos (Power Bi-Tableau) Data Analyst $29,000 Mensual Empresa confidencial CDMX
9 Analista de base de datos Data Analyst $25,000 - $27,000 Mensual SERVICIOS DE ENERGIA E... Monterrey, N.L.
In [ ]:
# Visual exploration of the Job variable (Top 20 positions)
number_positions = 20
fig, ax = plt.subplots(figsize = (10, 7))
sns.barplot(x=df['Job'].value_counts()[:number_positions], 
            y=df['Job'].value_counts().keys()[:number_positions], 
            hue=df['Job'].value_counts().keys()[:number_positions],
            legend=False,
            palette="Blues_r", 
            alpha= 0.9)
plt.title('Top 20 Job Titles Frequency')
plt.xlabel('Frequency')
plt.ylabel('Job Title')
plt.show()
2024-01-17T10:13:29.346835 image/svg+xml Matplotlib v3.7.4, https://matplotlib.org/

In the plot above it is possible to observe:

  1. The job titles for the data-related positions have been successfully harmonized and they conform the majority of the observations in the dataset
  2. Along with the data positions, there is significant number of other non data-related positions. So, the latter will be removed from the dataset.
In [ ]:
# Drop of observations not corresponding to Data Jobs
df = df.loc[(df['Job'] == 'Data Analyst') 
            | (df['Job'] == 'Business Analyst') 
            | (df['Job'] == 'Data Engineer') 
            | (df['Job'] == 'Data Scientist') 
            | (df['Job'] == 'Data Architect')
            | (df['Job'] == 'BI Analyst')
            | (df['Job'] == 'ML Engineer')]
df['Job'].value_counts()
Out[ ]:
Job
Data Analyst        186
Business Analyst     93
Data Engineer        92
Data Scientist       88
BI Analyst           70
Data Architect       29
ML Engineer           5
Name: count, dtype: int64
In [ ]:
# Resulting number of observations
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 563 entries, 0 to 1181
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Original Job Title  563 non-null    object
 1   Job                 563 non-null    object
 2   Salary              563 non-null    object
 3   Company             563 non-null    object
 4   Location            518 non-null    object
dtypes: object(5)
memory usage: 26.4+ KB
In [ ]:
# Visual exploration of the cleaned Job variable 
fig, ax = plt.subplots(figsize = (8, 5))
sns.barplot(x=df['Job'].value_counts()[:7], 
            y=df['Job'].value_counts().keys()[:7], 
            hue=df['Job'].value_counts().keys()[:7], 
            legend=False,
            palette="Blues_r", 
            alpha=0.7)
plt.title('Frequency per Job Title')
plt.xlabel('Frequency')
plt.ylabel('Job Title')
plt.show()
2024-01-17T10:13:29.974790 image/svg+xml Matplotlib v3.7.4, https://matplotlib.org/

From the plot above, it is possible to see that the variable Job has been successfully cleaned as data-related jobs are the only observations in the data set and their job titles are consistent. It is also noteworthy that Data Analyst is the most demanded position.

4.3 Cleaning of the Attribute Company ¶

This section shows the process carried out to impute the null values for a string legend of "Confidential" and to provide consistency to the letter case in the company names of the retrieved data.

Moreover, whenever the company is published in the vacancy as "Empresa Confidencial"(Spanish for confidential company), it was changed to "Confidential" too.

In [ ]:
# Cleaning of the Company variable
df['Company'] = df['Company'].apply(lambda x: "Confidential" if pd.isnull(x) else x)
df['Company'] = df['Company'].apply(lambda x: str.title(str(x)))
df['Company'] = df['Company'].apply(lambda x: "Confidential" if x == "Empresa Confidencial" else x)
df.head(10)
Out[ ]:
Original Job Title Job Salary Company Location
0 Analista de datos Data Analyst $20,000 - $25,000 Mensual Skandia Operadora De F... Polanco II Sección, Miguel Hidalgo, CDMX
1 Analista de Datos Data Analyst Sueldo no mostrado por la empresa General De Seguros CDMX
2 Analista de datos MDM Data Analyst $17,000 - $20,000 Mensual Grupo Alnus S.A. De C.V. Monterrey, N.L.
3 ANALISTA DE DATOS ECOMMERCE Data Analyst $16,000 Mensual Grupo Daisa Cuauhtémoc, CDMX
4 Analista de datos maestros Data Analyst $10,000 Mensual Confidential Monterrey, N.L.
5 Coordinador analista de datos Data Analyst Sueldo no mostrado por la empresa Confidential Miguel Hidalgo, CDMX
6 ANALISTA DE DATOS / EXPERTO EN EXCEL Data Analyst $11,000 - $13,000 Mensual Confidential Zapopan, Jal.
7 Analista de Datos en Azure (6 meses) Data Analyst $55,000 - $60,000 Mensual Sygno Monterrey, N.L.
8 Analista de Datos (Power Bi-Tableau) Data Analyst $29,000 Mensual Confidential CDMX
9 Analista de base de datos Data Analyst $25,000 - $27,000 Mensual Servicios De Energia E... Monterrey, N.L.
In [ ]:
# Visual exploration of the cleaned Company variable
fig, ax = plt.subplots(figsize = (8, 6))
sns.barplot(x=df['Company'].value_counts()[:15], 
            y=df['Company'].value_counts().keys()[:15],
            hue=df['Company'].value_counts().keys()[:15],
            legend=False, 
            palette="Blues_r", 
            alpha = 0.8)
plt.title('Top 15 Companies Offering Data Jobs')
plt.xlabel('Frequency')
plt.ylabel('Company')
plt.show()
2024-01-17T10:13:30.770790 image/svg+xml Matplotlib v3.7.4, https://matplotlib.org/

The plot above suggests that the letter case of the different company names is now consistent, and all the vacancies with confidential companies have been harmonized as Confidential. So, it is deemed that the variable Company has been successfully cleaned.

In this sense, it is noteworthy that most of the vacancies for data jobs correspond to confidential companies. However, from those publications whose the hiring company is not confidential, the top company demanding data jobs in Mexico is Bairesdev.

4.4 Cleaning of the attribute Location ¶

This section shows the process carried out to harmonize the name of the Mexican States in which the data-related positions are demanded.

To do so, firstly, the state component was extracted from each location observation.

In [ ]:
# Extraction of the state from the location data
df['Location'] = df['Location'].apply(lambda x: "Remote/NA" if pd.isnull(x) else x)
df['Location'] = df['Location'].apply(lambda x: str(x).split(",")[-1] if len(str(x).split(",")) > 1 else x)
df['Location'] = df['Location'].apply(lambda x: str(x.replace('\u200b', ' ')).strip())
df.head(10)
Out[ ]:
Original Job Title Job Salary Company Location
0 Analista de datos Data Analyst $20,000 - $25,000 Mensual Skandia Operadora De F... CDMX
1 Analista de Datos Data Analyst Sueldo no mostrado por la empresa General De Seguros CDMX
2 Analista de datos MDM Data Analyst $17,000 - $20,000 Mensual Grupo Alnus S.A. De C.V. N.L.
3 ANALISTA DE DATOS ECOMMERCE Data Analyst $16,000 Mensual Grupo Daisa CDMX
4 Analista de datos maestros Data Analyst $10,000 Mensual Confidential N.L.
5 Coordinador analista de datos Data Analyst Sueldo no mostrado por la empresa Confidential CDMX
6 ANALISTA DE DATOS / EXPERTO EN EXCEL Data Analyst $11,000 - $13,000 Mensual Confidential Jal.
7 Analista de Datos en Azure (6 meses) Data Analyst $55,000 - $60,000 Mensual Sygno N.L.
8 Analista de Datos (Power Bi-Tableau) Data Analyst $29,000 Mensual Confidential CDMX
9 Analista de base de datos Data Analyst $25,000 - $27,000 Mensual Servicios De Energia E... N.L.

Then, a dictionary was defined using the abbreviations in the keys and the full state names in the values. After that, the dictionary was passed to the data in order to replace the abbreviations by the full state names.

In [ ]:
# Dictionary with the retrieved Location values
location_dict = {
    'CDMX': 'Ciudad de México',
    'Chih.': 'Chihuahua',
    'Edo. Méx.': 'Estado de México',
    'Gto.': 'Guanajuato',
    'Jal.': 'Jalisco',
    'Q. Roo': 'Quintana Roo',
    'N. L.': 'Nuevo León',
    'N.L.': 'Nuevo León',
    'Pue.': 'Puebla',
    'México': 'Estado de México',
    'Zac.': 'Zacatecas',
    'Tamps.': 'Tamaulipas',
    'Mor.': 'Morelos',
    'Sin.': 'Sinaloa',
    'Oax.': 'Oaxaca',
    'Qro.': 'Querétaro',
    'Mich.': 'Michoacán',
    'Son.': 'Sonora',
    'BC.': 'Baja California',
    'SLP.': 'San Luis Potosí',
    'Yuc.': 'Yucatán',
    'Coah.': 'Coahuila',
    'BCS.': 'Baja California Sur',
    'Nay.': 'Nayarit',
    'Ags.': 'Aguascalientes',
    'Hgo.': 'Hidalgo',
    'Chis.': 'Chiapas',
    'Ver.': 'Veracruz',
    'Tab.': 'Tabasco'
    }
In [ ]:
# Cleaning of the Location variable in the dataset
for i in range(len(location_dict)):
  df['Location'] = df['Location'].apply(lambda x: list(location_dict.values())[i] if x == list(location_dict.keys())[i] else x)

df.head(10)
Out[ ]:
Original Job Title Job Salary Company Location
0 Analista de datos Data Analyst $20,000 - $25,000 Mensual Skandia Operadora De F... Ciudad de México
1 Analista de Datos Data Analyst Sueldo no mostrado por la empresa General De Seguros Ciudad de México
2 Analista de datos MDM Data Analyst $17,000 - $20,000 Mensual Grupo Alnus S.A. De C.V. Nuevo León
3 ANALISTA DE DATOS ECOMMERCE Data Analyst $16,000 Mensual Grupo Daisa Ciudad de México
4 Analista de datos maestros Data Analyst $10,000 Mensual Confidential Nuevo León
5 Coordinador analista de datos Data Analyst Sueldo no mostrado por la empresa Confidential Ciudad de México
6 ANALISTA DE DATOS / EXPERTO EN EXCEL Data Analyst $11,000 - $13,000 Mensual Confidential Jalisco
7 Analista de Datos en Azure (6 meses) Data Analyst $55,000 - $60,000 Mensual Sygno Nuevo León
8 Analista de Datos (Power Bi-Tableau) Data Analyst $29,000 Mensual Confidential Ciudad de México
9 Analista de base de datos Data Analyst $25,000 - $27,000 Mensual Servicios De Energia E... Nuevo León
In [ ]:
# Visual exploration of the cleaned Location variable
fig, ax = plt.subplots(figsize = (5, 8))
sns.barplot(x=df['Location'].value_counts().values, 
            y=df['Location'].value_counts().index, 
            hue=df['Location'].value_counts().index, 
            legend=False,
            palette="Spectral_r", 
            alpha = 0.9)
plt.title('Frequency per Mexican State')
plt.xlabel('Frequency')
plt.ylabel('State')
plt.show()
2024-01-17T10:13:31.679790 image/svg+xml Matplotlib v3.7.4, https://matplotlib.org/

The plot above shows that the names of the Mexican states where the data job vacancies are demanded is now consistent. So, the variable Location has been successfully cleaned.

Moreover, it is noteworthy that Ciudad de México (Mexico City) is the location where the most of the data jobs are demanded.

4.5 Cleaning of the Attribute Salary ¶

This section shows the process carried out to wrangle and clean the salary data. To do so, the salary observations were cleaned from useless characters, then, the salary ranges were splitted into two columns and a average salary was calculated when applicable. Finally, non useful Salary columns were dropped.

In [ ]:
# Checking if all the Salary observations are disclosed in a monthly basis
df[df['Salary'].apply(lambda x: True if ('Mensual' not in str(x)) and (x is not np.nan) else False) & (df['Salary'] != 'Sueldo no mostrado por la empresa')]
Out[ ]:
Original Job Title Job Salary Company Location

Thus, all of the observations correspond to monthly salaries.

In [ ]:
# Replacing the wording of the positions without disclosed salaries with NaN values
df['Salary'] = df['Salary'].apply(lambda x: np.nan if x == 'Sueldo no mostrado por la empresa' else x)
df.head()
Out[ ]:
Original Job Title Job Salary Company Location
0 Analista de datos Data Analyst $20,000 - $25,000 Mensual Skandia Operadora De F... Ciudad de México
1 Analista de Datos Data Analyst NaN General De Seguros Ciudad de México
2 Analista de datos MDM Data Analyst $17,000 - $20,000 Mensual Grupo Alnus S.A. De C.V. Nuevo León
3 ANALISTA DE DATOS ECOMMERCE Data Analyst $16,000 Mensual Grupo Daisa Ciudad de México
4 Analista de datos maestros Data Analyst $10,000 Mensual Confidential Nuevo León
In [ ]:
# Removing useless characters from column values
char_remove = {
                'Anual': '',
                'Mensual': '',
                '$': '',
                ',': ''}

for key, value in char_remove.items():
    df['Salary'] = df['Salary'].str.replace(key, value).str.strip()

df.head()
Out[ ]:
Original Job Title Job Salary Company Location
0 Analista de datos Data Analyst 20000 - 25000 Skandia Operadora De F... Ciudad de México
1 Analista de Datos Data Analyst NaN General De Seguros Ciudad de México
2 Analista de datos MDM Data Analyst 17000 - 20000 Grupo Alnus S.A. De C.V. Nuevo León
3 ANALISTA DE DATOS ECOMMERCE Data Analyst 16000 Grupo Daisa Ciudad de México
4 Analista de datos maestros Data Analyst 10000 Confidential Nuevo León
In [ ]:
# Splitting Salary column values into two columns
df[['Min Salary', 'Max Salary']] = df['Salary'].str.split('-', expand=True)
df['Min Salary'] = df['Min Salary'].str.strip().astype(np.float64)
df['Max Salary'] = df['Max Salary'].str.strip().astype(np.float64)
df.head()
Out[ ]:
Original Job Title Job Salary Company Location Min Salary Max Salary
0 Analista de datos Data Analyst 20000 - 25000 Skandia Operadora De F... Ciudad de México 20000.0 25000.0
1 Analista de Datos Data Analyst NaN General De Seguros Ciudad de México NaN NaN
2 Analista de datos MDM Data Analyst 17000 - 20000 Grupo Alnus S.A. De C.V. Nuevo León 17000.0 20000.0
3 ANALISTA DE DATOS ECOMMERCE Data Analyst 16000 Grupo Daisa Ciudad de México 16000.0 NaN
4 Analista de datos maestros Data Analyst 10000 Confidential Nuevo León 10000.0 NaN
In [ ]:
# Calculating average salary for each vacancy
df['Avg Salary'] = (df['Min Salary'] + df['Max Salary']) / 2
df.head()
Out[ ]:
Original Job Title Job Salary Company Location Min Salary Max Salary Avg Salary
0 Analista de datos Data Analyst 20000 - 25000 Skandia Operadora De F... Ciudad de México 20000.0 25000.0 22500.0
1 Analista de Datos Data Analyst NaN General De Seguros Ciudad de México NaN NaN NaN
2 Analista de datos MDM Data Analyst 17000 - 20000 Grupo Alnus S.A. De C.V. Nuevo León 17000.0 20000.0 18500.0
3 ANALISTA DE DATOS ECOMMERCE Data Analyst 16000 Grupo Daisa Ciudad de México 16000.0 NaN NaN
4 Analista de datos maestros Data Analyst 10000 Confidential Nuevo León 10000.0 NaN NaN

It seems that some salary data is only in the Min Salary column. Those observations correspond to the vacancies in which the published salary was not a range but a single value. Thus, said value was copied into the Max Salary and Avg Salary columns.

In [ ]:
# Imputing NaN values in the Avg Salary column with the values in the Min Salary column
mask = pd.isna(df['Avg Salary']) & ~pd.isna(df['Min Salary'])
df.loc[mask, 'Avg Salary'] = df['Min Salary'] # Using loc to avoid the SettingWithCopyWarning
df.loc[mask, 'Max Salary'] = df['Min Salary'] # Using loc to avoid the SettingWithCopyWarning
df.head()
Out[ ]:
Original Job Title Job Salary Company Location Min Salary Max Salary Avg Salary
0 Analista de datos Data Analyst 20000 - 25000 Skandia Operadora De F... Ciudad de México 20000.0 25000.0 22500.0
1 Analista de Datos Data Analyst NaN General De Seguros Ciudad de México NaN NaN NaN
2 Analista de datos MDM Data Analyst 17000 - 20000 Grupo Alnus S.A. De C.V. Nuevo León 17000.0 20000.0 18500.0
3 ANALISTA DE DATOS ECOMMERCE Data Analyst 16000 Grupo Daisa Ciudad de México 16000.0 16000.0 16000.0
4 Analista de datos maestros Data Analyst 10000 Confidential Nuevo León 10000.0 10000.0 10000.0
In [ ]:
# Visual exploration of the cleaned Salary variable
fig, ax = plt.subplots(figsize = (10, 7))
sns.histplot(data=df['Avg Salary'], 
             color=sns.color_palette('Blues_r')[0], 
             alpha=0.6)
plt.title('Salary Distribution')
plt.xlabel('Salary')
plt.ylabel('Frequency')
plt.show()
2024-01-17T10:13:32.417790 image/svg+xml Matplotlib v3.7.4, https://matplotlib.org/

The histogram above suggests that all Salary observations have been transformed from a range in a string format into actual numbers. So, it is considered that the variable Salary has been successfully cleaned. On the other hand, it is noteworthy that some observations fall well beyond the main salary distribution. Let's take a look into those observations.

In [ ]:
# Exploring positions whose salary is more than $70,000 MXN per month
df[df['Avg Salary'] > 70000]
Out[ ]:
Original Job Title Job Salary Company Location Min Salary Max Salary Avg Salary
48 Analista de datos sr. Data Analyst 70000 - 80000 Confidential Ciudad de México 70000.0 80000.0 75000.0
52 Analista de datos Sr. Data Analyst 80000 Confidential Ciudad de México 80000.0 80000.0 80000.0
135 SQL Server Data Analyst Data Analyst 70000 - 85000 Caspex Corp Remote/NA 70000.0 85000.0 77500.0
199 Científico de Datos Data Scientist 90000 Ecosistemex S. De R.L.... Ciudad de México 90000.0 90000.0 90000.0
225 Sr Data Scientist Data Scientist 70000 - 80000 Enterprise Solutions, ... Ciudad de México 70000.0 80000.0 75000.0
281 Sr Data Engineer Data Engineer 75000 - 80000 Addon Technologies Inc. Jalisco 75000.0 80000.0 77500.0
427 Data engineer (Databricks) Data Engineer 80000 - 90000 Manpower, S.A. De C.V. Estado de México 80000.0 90000.0 85000.0
428 Data Engineer (Databricks) Data Engineer 80000 - 90000 Manpower, S.A. De C.V. Estado de México 80000.0 90000.0 85000.0
448 Data Engineer with Fivetran developer Data Engineer 70000 - 80000 Enterprise Solutions, ... Remote/NA 70000.0 80000.0 75000.0
479 Data Architect Data Architect 70000 - 80000 Softtek Remote/NA 70000.0 80000.0 75000.0

From the above results, it is plausible that Data Engineer and Data Architect positions earns an extremely high salaries as the ones showed as such positions require advanced programming skills and expertise. However, it is interesting that some Sr. Data Analyst and Sr. Data Scientists positions made it to the top.

Finally, the original Salary column was dropped.

In [ ]:
# Dropping of the original Salary column
df = df.drop(columns = ['Salary'])
df.head()
Out[ ]:
Original Job Title Job Company Location Min Salary Max Salary Avg Salary
0 Analista de datos Data Analyst Skandia Operadora De F... Ciudad de México 20000.0 25000.0 22500.0
1 Analista de Datos Data Analyst General De Seguros Ciudad de México NaN NaN NaN
2 Analista de datos MDM Data Analyst Grupo Alnus S.A. De C.V. Nuevo León 17000.0 20000.0 18500.0
3 ANALISTA DE DATOS ECOMMERCE Data Analyst Grupo Daisa Ciudad de México 16000.0 16000.0 16000.0
4 Analista de datos maestros Data Analyst Confidential Nuevo León 10000.0 10000.0 10000.0
In [ ]:
df.describe()
Out[ ]:
Min Salary Max Salary Avg Salary
count 197.000000 197.000000 197.000000
mean 30114.121827 34211.827411 32162.974619
std 18295.173638 20662.664676 19417.175845
min 4000.000000 4000.000000 4000.000000
25% 16000.000000 19000.000000 17500.000000
50% 25000.000000 30000.000000 27500.000000
75% 40000.000000 45000.000000 42000.000000
max 90000.000000 90000.000000 90000.000000
In [ ]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 563 entries, 0 to 1181
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Original Job Title  563 non-null    object 
 1   Job                 563 non-null    object 
 2   Company             563 non-null    object 
 3   Location            563 non-null    object 
 4   Min Salary          197 non-null    float64
 5   Max Salary          197 non-null    float64
 6   Avg Salary          197 non-null    float64
dtypes: float64(3), object(4)
memory usage: 51.4+ KB

So, the dataset has been successfully cleaned and prepared for the Data Analysis & Visualization.

4.6 Processed Dataset Exporting ¶

The processed dataset was exported to CSV.

In [ ]:
# Exporting the processed dataset
df.to_csv('Dataset_processed.csv', index=False, encoding='utf-8')

4.7 Salary Dataset ¶

The second dataframe with only those vacancies whose salary have been disclosed was built in the present section.

In [ ]:
# Second dataframe with only disclosed salary data
salary_df = df.dropna(axis = 0, how='any', subset = ['Avg Salary'])
salary_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 197 entries, 0 to 1067
Data columns (total 7 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Original Job Title  197 non-null    object 
 1   Job                 197 non-null    object 
 2   Company             197 non-null    object 
 3   Location            197 non-null    object 
 4   Min Salary          197 non-null    float64
 5   Max Salary          197 non-null    float64
 6   Avg Salary          197 non-null    float64
dtypes: float64(3), object(4)
memory usage: 12.3+ KB
In [ ]:
salary_df.describe()
Out[ ]:
Min Salary Max Salary Avg Salary
count 197.000000 197.000000 197.000000
mean 30114.121827 34211.827411 32162.974619
std 18295.173638 20662.664676 19417.175845
min 4000.000000 4000.000000 4000.000000
25% 16000.000000 19000.000000 17500.000000
50% 25000.000000 30000.000000 27500.000000
75% 40000.000000 45000.000000 42000.000000
max 90000.000000 90000.000000 90000.000000
In [ ]:
salary_df.head(10)
Out[ ]:
Original Job Title Job Company Location Min Salary Max Salary Avg Salary
0 Analista de datos Data Analyst Skandia Operadora De F... Ciudad de México 20000.0 25000.0 22500.0
2 Analista de datos MDM Data Analyst Grupo Alnus S.A. De C.V. Nuevo León 17000.0 20000.0 18500.0
3 ANALISTA DE DATOS ECOMMERCE Data Analyst Grupo Daisa Ciudad de México 16000.0 16000.0 16000.0
4 Analista de datos maestros Data Analyst Confidential Nuevo León 10000.0 10000.0 10000.0
6 ANALISTA DE DATOS / EXPERTO EN EXCEL Data Analyst Confidential Jalisco 11000.0 13000.0 12000.0
7 Analista de Datos en Azure (6 meses) Data Analyst Sygno Nuevo León 55000.0 60000.0 57500.0
8 Analista de Datos (Power Bi-Tableau) Data Analyst Confidential Ciudad de México 29000.0 29000.0 29000.0
9 Analista de base de datos Data Analyst Servicios De Energia E... Nuevo León 25000.0 27000.0 26000.0
10 Analista de Bases de Datos BI Data Analyst Confidential Ciudad de México 28000.0 29000.0 28500.0
11 Analista de Gestión de Datos Jr. Data Analyst Pepsico División Foods Sonora 11000.0 11500.0 11250.0

5. Data Analysis & Visualization ¶


After the dataset was cleaned and prepared, the data was analyzed and visualized in order to answer the following questions:

  • What is the most demanded data job?
  • Where do the most data jobs locate?
  • How is the demand per state by data job category?
  • What are the companies demanding more data jobs?
  • How is the data jobs demand per company?
  • Where do the companies demanding data jobs locate?
  • How many salary observations there are for each data job category?
  • Which data job category has the highest salaries?
  • Where the highest salaries can be found?
  • What companies offer the highest salaries?
  • What companies offer the highest salaries per data job category?

To answer these questions, different plots and charts were drawn using several libraries such as Matplotlib, Seaborn, Folium, and Plotly, with the purpose of testing them and selecting for reporting the most pleasant and aesthetic results.

On the other hand, a specific directory named "Figures" was used to store the relevant plots created during the analysis.

In [ ]:
if not os.path.exists("Figures"):
    os.mkdir("Figures")

5.1 What is the most demanded data job? ¶

To answer this question, a donut chart was plotted using the resulting series from the value_counts method.

The idea was to show the different data jobs categories as the arcs in the donut chart, in which the long of the arcs represents the demand in the Mexican job market.

The donut charts were drawn using both Matplotlib and Plotly.

In [ ]:
# Color palette for pie charts
pie_colors = ['#154360','#539ecd','#89bedc',"#a9cce3", "#d4e6f1",'#dbe9f6', "#ebf5fb"]
In [ ]:
# Donut chart of the data jobs demand with Matplotlib
fig, ax = plt.subplots(figsize = (8, 13))
explode = [0.02] * 7
wedges, texts, autotexts = ax.pie(x = list(df['Job'].value_counts().values), 
                                  wedgeprops=dict(edgecolor='w', linewidth= 1),
                                  textprops=dict(size=17, weight="bold", color = 'dimgray'), 
                                  colors=pie_colors,                                  
                                  autopct='%.0f%%', 
                                  pctdistance=1.15, 
                                  startangle = 90, 
                                  counterclock = False, 
                                  explode = explode)
center = plt.Circle( (0,0), 0.7, color='white')
p = plt.gcf()
p.gca().add_artist(center)
plt.legend(wedges, 
            list(df['Job'].value_counts().keys()), 
            fontsize=13, 
            #loc="center right", 
            bbox_to_anchor=(1, 0, 0.5, 0.75))
plt.tight_layout()
plt.savefig('Figures/Fig1_DemandOfDataJobsPerCategory.png',  bbox_inches = 'tight')
plt.show()
2024-01-17T10:13:34.248789 image/svg+xml Matplotlib v3.7.4, https://matplotlib.org/
In [ ]:
# Donut chart of the data jobs demand with Plotly

job_df = pd.DataFrame(df['Job'].value_counts().reset_index().rename(columns = {'count': 'Count'}))

fig = px.pie(job_df, 
             values='Count', 
             names='Job', 
             color='Job', 
             hole=0.7,  
             color_discrete_sequence=px.colors.sequential.Blues_r,
             height=500,
             width=700,
             title='<b>Demand of Data Jobs Per Category</b>')
fig.update_layout(title_x=0.5, 
                  font=font_px,
                  legend = legend_px                  
                  )
fig.update_traces(hoverinfo='label+percent+name',
                  textfont_size=18,
                  textinfo='percent',
                  texttemplate='%{percent:.0%}',
                  direction ='clockwise',
                  marker=dict(colors=pie_colors, line=dict(color='white', width=1)))
fig.write_image("Figures/Fig1_DemandOfDataJobsPerCategory2.png", scale=2)
fig.write_image("Figures/Fig1_DemandOfDataJobsPerCategory2.svg", scale=2)
fig.show(config=config)
33%17%16%16%12%5%1%Data AnalystBusiness AnalystData EngineerData ScientistBI AnalystData ArchitectML EngineerDemand of Data Jobs Per Category

From the above plot, it is possible to conclude that a third of the data jobs (33%) correspond to positions of Data Analyst, rendering them as the most demanded ones in the Mexican labor market at the time of this study. In second place, it is possible to found the positions of Business Analyst with about 17% of the total.

Furthermore, it is noteworthy that Data Engineer and Data Scientists positions are more demanded than Business Intelligence Analyst positions with about 16%, about 16% and about 12% of the demand in the country, respectively.

On the contrary, Data Architect and Machine Learning Engineers positions are the less demanded, with only about 5% and about 1% out of the total, respectively.

5.2 Where do the most data jobs locate? ¶

To answer this question, a choropleth was prepared using Folium and Plotly, in which the intensity of the color was proportional to the demand of data jobs per state.

First, to define the specific location of the markers in the Folium map, a dictionary was defined with the corresponding ID, Latitude and Longitude of the capital cities for each Mexican State.

The IDs correspond to the IDs of the file with the geographical data (Arroyo-Velázquez, 2022).

Then, the dictionary was converted into a Pandas dataframe.

In [ ]:
# Mexican states dictionary with corresponding ID, Latitude and Longitude

states_dict = {'Aguascalientes': ('AS', 21.87945992,	-102.2904135),
                  'Baja California': ('BC', 32.663214,-115.4903741),
                  'Baja California Sur': ('BS', 24.1584937,-110.315928),
                  'Campeche': ('CC', 19.8450352,-90.5381231),
                  'Chiapas': ('CS', 16.7541485,-93.119001),
                  'Chihuahua': ('CH', 28.6349557,-106.0777049),
                  'Coahuila': ('CL', 25.4286965,-100.9994484),
                  'Colima': ('CM', 19.2408324,-103.7291389),
                  'Ciudad de México': ('DF', 19.4335493,-99.1344048),
                  'Durango': ('DG', 24.0241017,-104.6708325),
                  'Guanajuato': ('GT', 21.0176446,-101.2586863),
                  'Guerrero': ('GR', 17.5516921,-99.5025877),
                  'Hidalgo': ('HG', 20.1183855,-98.7540094),
                  'Jalisco': ('JC', 20.6773775,-103.3494204),
                  'Estado de México': ('MC', 19.289191,-99.6670425),
                  'Michoacán': ('MN', 19.7030535,-101.1937953),
                  'Morelos': ('MS', 18.9218499,-99.2353856),
                  'Nayarit': ('NT', 21.5122308,-104.8948845),
                  'Nuevo León': ('NL', 25.6717637,-100.3163831),
                  'Oaxaca': 	('OC', 17.0617935,-96.7271634),
                  'Puebla': ('PL', 19.0428817,-98.2002919),
                  'Querétaro': ('QT', 20.37998212,	-100.0000308),
                  'Quintana Roo': ('QR', 18.4978052,-88.3029951),
                  'San Luis Potosí': ('SP', 22.1521646,-100.9765552),
                  'Sinaloa': ('SL', 24.8082702,-107.3945828),
                  'Sonora': ('SR', 29.0748734,-110.9597578),
                  'Tabasco': ('TC', 17.9882632,-92.9209807),
                  'Tamaulipas': ('TS', 23.7312703,-99.1517694),
                  'Tlaxcala': ('TL', 19.3171271,-98.2386354),
                  'Veracruz': ('VZ', 19.5269375,-96.92401),
                  'Yucatán': ('YN', 20.9664386,-89.623114),
                  'Zacatecas': ('ZS', 22.7753476,-102.5740002)}
                
states_df = pd.DataFrame.from_dict(states_dict, orient='index').reset_index().\
                    rename(columns={"index": "State", 0: "ID", 1: "Lat", 2: "Long"}).set_index('State')
states_df.head()
Out[ ]:
ID Lat Long
State
Aguascalientes AS 21.879460 -102.290413
Baja California BC 32.663214 -115.490374
Baja California Sur BS 24.158494 -110.315928
Campeche CC 19.845035 -90.538123
Chiapas CS 16.754148 -93.119001

After that, with the purpose of providing a color to each Mexican state in the choropleth, the percentage of data jobs for each state was calculated using the value_counts method. For those states without data jobs, a join with the previously defined states_df dataframe was performed and the NaN values were filled with zeros.

In [ ]:
# Calculating the Demand Percentage per State
demand_by_state_df = pd.DataFrame(df['Location'].value_counts())
total = sum(demand_by_state_df['count'])
demand_by_state_df['Percentage'] = (demand_by_state_df['count']) / total *100
demand_by_state_df = demand_by_state_df.reset_index().rename(columns={"Location": "State", "count": "Count"})
demand_by_state_df = states_df.merge(demand_by_state_df, left_on='State', right_on='State', how = 'outer').fillna(0)
demand_by_state_df
Out[ ]:
State ID Lat Long Count Percentage
0 Aguascalientes AS 21.879460 -102.290413 4.0 0.710480
1 Baja California BC 32.663214 -115.490374 4.0 0.710480
2 Baja California Sur BS 24.158494 -110.315928 1.0 0.177620
3 Campeche CC 19.845035 -90.538123 0.0 0.000000
4 Chiapas CS 16.754148 -93.119001 0.0 0.000000
5 Chihuahua CH 28.634956 -106.077705 11.0 1.953819
6 Coahuila CL 25.428697 -100.999448 0.0 0.000000
7 Colima CM 19.240832 -103.729139 0.0 0.000000
8 Ciudad de México DF 19.433549 -99.134405 320.0 56.838366
9 Durango DG 24.024102 -104.670833 0.0 0.000000
10 Guanajuato GT 21.017645 -101.258686 5.0 0.888099
11 Guerrero GR 17.551692 -99.502588 0.0 0.000000
12 Hidalgo HG 20.118385 -98.754009 0.0 0.000000
13 Jalisco JC 20.677377 -103.349420 51.0 9.058615
14 Estado de México MC 19.289191 -99.667042 22.0 3.907638
15 Michoacán MN 19.703053 -101.193795 1.0 0.177620
16 Morelos MS 18.921850 -99.235386 0.0 0.000000
17 Nayarit NT 21.512231 -104.894885 1.0 0.177620
18 Nuevo León NL 25.671764 -100.316383 51.0 9.058615
19 Oaxaca OC 17.061794 -96.727163 2.0 0.355240
20 Puebla PL 19.042882 -98.200292 6.0 1.065719
21 Querétaro QT 20.379982 -100.000031 11.0 1.953819
22 Quintana Roo QR 18.497805 -88.302995 0.0 0.000000
23 San Luis Potosí SP 22.152165 -100.976555 7.0 1.243339
24 Sinaloa SL 24.808270 -107.394583 6.0 1.065719
25 Sonora SR 29.074873 -110.959758 2.0 0.355240
26 Tabasco TC 17.988263 -92.920981 1.0 0.177620
27 Tamaulipas TS 23.731270 -99.151769 1.0 0.177620
28 Tlaxcala TL 19.317127 -98.238635 0.0 0.000000
29 Veracruz VZ 19.526937 -96.924010 2.0 0.355240
30 Yucatán YN 20.966439 -89.623114 9.0 1.598579
31 Zacatecas ZS 22.775348 -102.574000 0.0 0.000000
32 Remote/NA 0 0.000000 0.000000 45.0 7.992895

For the Folium map, a dictionary was defined to assign a specific color for each type of data job.

In [ ]:
# Dictionary to map job colors
job_colors = {
    'Data Analyst': 'blue',
    'Business Analyst': 'green',
    'Data Engineer': 'violet',
    'Data Scientist': 'red',
    'Data Architect': 'yellow',
    'BI Analyst': 'orange',
    'ML Engineer': 'black'
  }

Later, each vacancy was colored using the dictionary above to map job color.

In [ ]:
# Join of the job data dataframe with the Mexican States latitude and longitud
jobs_loc = df.merge(states_df, left_on='Location', right_on='State', how = 'inner')
jobs_loc['Marker Color'] = jobs_loc['Job'].map(job_colors)
jobs_loc.head()
Out[ ]:
Original Job Title Job Company Location Min Salary Max Salary Avg Salary ID Lat Long Marker Color
0 Analista de datos Data Analyst Skandia Operadora De F... Ciudad de México 20000.0 25000.0 22500.0 DF 19.433549 -99.134405 blue
1 Analista de Datos Data Analyst General De Seguros Ciudad de México NaN NaN NaN DF 19.433549 -99.134405 blue
2 ANALISTA DE DATOS ECOMMERCE Data Analyst Grupo Daisa Ciudad de México 16000.0 16000.0 16000.0 DF 19.433549 -99.134405 blue
3 Coordinador analista de datos Data Analyst Confidential Ciudad de México NaN NaN NaN DF 19.433549 -99.134405 blue
4 Analista de Datos (Power Bi-Tableau) Data Analyst Confidential Ciudad de México 29000.0 29000.0 29000.0 DF 19.433549 -99.134405 blue

Then, the choropleth using Folium was created, and the vacancy markers were added.

The JSON file with the geographical data was taken from Arroyo-Velázquez (2022).

In [ ]:
# Initialization of Folium map
m = folium.Map(location=[24,-105], zoom_start=5)

# Setting of the Choropleth
folium.Choropleth(
    geo_data = 'https://raw.githubusercontent.com/isaacarroyov/data_visualization_practice/master/Python/visualizing_mexican_wildfires_tds/data/states_mx.json',
    name = "Data Jobs Localization",
    data = demand_by_state_df,
    columns = ["ID", "Percentage"],
    key_on = "feature.id",
    fill_color = "Blues",
    fill_opacity = 0.7,
    line_opacity = 0.1,
    legend_name = "Data Jobs Demand (%)",
).add_to(m)

marker_cluster = MarkerCluster()
m.add_child(marker_cluster)

# Adding of individual markers to clusters
for index, record in jobs_loc.iterrows():
    marker = folium.Marker(list(record[8:10]),
                           icon = folium.Icon(color='white', icon_color=str(record[-1])))
    marker_cluster.add_child(marker)

# Setting of the Layer Control
folium.LayerControl().add_to(m)

# Showing the map
m
Out[ ]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Finally, a choropleth map with Plotly was drawn using the same dataframe.

In [ ]:
# Choropleth map with Plotly

fig = px.choropleth(demand_by_state_df, 
                    geojson = 'https://raw.githubusercontent.com/isaacarroyov/data_visualization_practice/master/Python/visualizing_mexican_wildfires_tds/data/states_mx.json', 
                    locations='ID', 
                    color='Percentage',
                    color_continuous_scale="Blues",
                    scope="north america",
                    #title='Demand of Data Jobs per Mexican State',
                    labels={'Percentage': 'National<br>Demand %'},
                    height= 500,
                    width = 800                 
                    )
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0}, title_x=0.5, font=font_px)
fig.update_geos(fitbounds="locations", visible=False)
fig.write_image("Figures/Fig2_DemandOfDataJobsPerMexicanState.png", scale=2)
fig.write_image("Figures/Fig2_DemandOfDataJobsPerMexicanState.svg", scale=2)
fig.show(config=config)
01020304050NationalDemand %

In conclusion, the above maps strongly suggest that most of the data jobs are concentrated in the capital, Mexico City, with more than 50% of the demand at the moment of this study.

On the other hand, according to the data, Nuevo León, and Jalisco represent distant second places, with only about 10% of the total demand each.

5.3 How is the data jobs demand per location? ¶

To answer this question, a stacked bar plot and a heatmap were drawn.

First, a dataframe was prepared using the pivot_table function from Pandas using the location as the index. The vacancies without specific locations disclosed were removed.

The stacked bar plots were drawn using both Matplotlib and Plotly; whereas the heatmap were drawn using both Seaborn and Plotly.

In [ ]:
# Preparing dataframe of jobs per location
jobs_per_location_df = (df.pivot_table(index = 'Location', columns = 'Job', values = 'Company', aggfunc = 'count')
                    .fillna(0).drop("Remote/NA"))
jobs_per_location_df['Total'] = jobs_per_location_df.sum(axis=1, numeric_only= True)
jobs_per_location_df.head()
Out[ ]:
Job BI Analyst Business Analyst Data Analyst Data Architect Data Engineer Data Scientist ML Engineer Total
Location
Aguascalientes 0.0 2.0 1.0 0.0 1.0 0.0 0.0 4.0
Baja California 1.0 1.0 1.0 0.0 1.0 0.0 0.0 4.0
Baja California Sur 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0
Chihuahua 8.0 1.0 2.0 0.0 0.0 0.0 0.0 11.0
Ciudad de México 37.0 54.0 101.0 18.0 50.0 58.0 2.0 320.0

As most of the data jobs concentrates in few locations, the top 5 were used to create a new dataframe to be used to draw the bar plots.

In [ ]:
# Top 5 locations
top_locations = 5
jobs_per_location_bar_plot_df = (jobs_per_location_df.sort_values('Total', ascending = False)[:top_locations]
                                 .sort_values('Total', ascending = True).drop(columns = 'Total')
                                )
jobs_per_location_bar_plot_df.head()
Out[ ]:
Job BI Analyst Business Analyst Data Analyst Data Architect Data Engineer Data Scientist ML Engineer
Location
Querétaro 2.0 3.0 5.0 0.0 1.0 0.0 0.0
Estado de México 1.0 4.0 12.0 0.0 4.0 1.0 0.0
Nuevo León 5.0 6.0 22.0 0.0 11.0 7.0 0.0
Jalisco 5.0 10.0 16.0 3.0 8.0 6.0 3.0
Ciudad de México 37.0 54.0 101.0 18.0 50.0 58.0 2.0
In [ ]:
# Order of the data jobs according to demand
data_jobs_order = list(df['Job'].value_counts().keys())
data_jobs_order
Out[ ]:
['Data Analyst',
 'Business Analyst',
 'Data Engineer',
 'Data Scientist',
 'BI Analyst',
 'Data Architect',
 'ML Engineer']
In [ ]:
# Bar plot of Jobs by State with Matplotlib

# Stack bar chart of Data jobs per State
jobs_per_location_bar_plot_df.plot.barh(stacked=True, 
                                        y = data_jobs_order, 
                                        color=sns.color_palette('Blues_r',7),
                                        figsize=(8, 5),
                                        width = 0.85 )
plt.legend(facecolor = 'white', loc = 'center right', title = 'Data Job Category')
plt.xlabel('Vacancies', weight = 'bold')
plt.ylabel('Location', weight = 'bold')
plt.title('Demand Per Data Job Category In Top Locations\n')
plt.savefig('Figures/Fig3_DemandPerDataJobCategoryInTopLocations1.png',  bbox_inches = 'tight')
plt.show()
2024-01-17T10:14:21.155685 image/svg+xml Matplotlib v3.7.4, https://matplotlib.org/
In [ ]:
# Bar plot of Jobs by State with Plotly
fig = px.bar(jobs_per_location_bar_plot_df, 
            x=data_jobs_order, 
            y=jobs_per_location_bar_plot_df.index,
            color_discrete_sequence=px.colors.sequential.Blues_r,
            height = 500,
            width = 800,
            title = '<b>Demand Per Data Job Category In Top Locations</b>',
            labels = {"value": "<b>Vacancies</b>", 
                      "variable": "Data Job Category",
                      "Location":"<b>Location</b>"},
            barmode='stack'#'group'
            )
fig.update_layout(title_x=0.5, font=font_px, legend=legend_px)
fig.write_image("Figures/Fig3_DemandPerDataJobCategoryInTopLocations2.png", scale=2)
fig.write_image("Figures/Fig3_DemandPerDataJobCategoryInTopLocations2.svg", scale=2)
fig.show(config=config)
0100200300QuerétaroEstado de MéxicoNuevo LeónJaliscoCiudad de MéxicoData Job CategoryData AnalystBusiness AnalystData EngineerData ScientistBI AnalystData ArchitectML EngineerDemand Per Data Job Category In Top LocationsVacanciesLocation

To draw the heatmaps, a new dataframe was built using the previous dataframe joined with the dataframe with the states data. This, with the purpose to draw the heapmap with all the states from Mexico, and not only with those with current vacancies.

In [ ]:
# Building Dataframe for Heatmap
jobs_per_location_heatmap_df = (jobs_per_location_df.join(states_df, how = 'outer').fillna(0)
                                .drop(columns = ['Total','ID', 'Lat', 'Long'])
                                )
jobs_per_location_heatmap_df.head()
Out[ ]:
BI Analyst Business Analyst Data Analyst Data Architect Data Engineer Data Scientist ML Engineer
Aguascalientes 0.0 2.0 1.0 0.0 1.0 0.0 0.0
Baja California 1.0 1.0 1.0 0.0 1.0 0.0 0.0
Baja California Sur 0.0 0.0 1.0 0.0 0.0 0.0 0.0
Campeche 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Chiapas 0.0 0.0 0.0 0.0 0.0 0.0 0.0
In [ ]:
# Heatmap of Data Jobs per State with Seaborn
plt.figure(figsize = (8,10))
ax = sns.heatmap(jobs_per_location_heatmap_df, 
                cmap = "Blues_r",
                mask = (jobs_per_location_heatmap_df == 0))
ax.set_facecolor('black')
cbar = ax.collections[0].colorbar
cbar.ax.tick_params(labelsize=11)
plt.xlabel('Data Job Category')
plt.ylabel('Location')
plt.xticks(rotation = 330) 
plt.title('Demand Per Location & Data Job Category\n')
plt.grid(False)
plt.savefig('Figures/Fig4_DemandPerLocationAndDataJobCategory1.png',  bbox_inches = 'tight')
plt.show()
2024-01-17T10:14:24.790684 image/svg+xml Matplotlib v3.7.4, https://matplotlib.org/

Heatmaps with Plotly require the dataset to be in a tidy (or long) form. So, the data was transformed once again using the method melt from Pandas.

In [ ]:
# Converting the Dataset into a tidy format
jobs_per_location_heatmap_tidy_df = pd.melt(jobs_per_location_heatmap_df.reset_index().rename(columns = {'index': 'Location'}), 
                                            id_vars= 'Location', var_name = 'Job', value_name = 'Vacancies').\
                                    sort_values(by = 'Location', ascending = False)
jobs_per_location_heatmap_tidy_df.head(10)
Out[ ]:
Location Job Vacancies
223 Zacatecas ML Engineer 0.0
95 Zacatecas Data Analyst 0.0
63 Zacatecas Business Analyst 0.0
159 Zacatecas Data Engineer 0.0
127 Zacatecas Data Architect 0.0
191 Zacatecas Data Scientist 0.0
31 Zacatecas BI Analyst 0.0
158 Yucatán Data Engineer 1.0
62 Yucatán Business Analyst 3.0
126 Yucatán Data Architect 0.0
In [ ]:
# Heatmap of Data Jobs per State with Plotly

# Heatmap with Plotly
fig = px.density_heatmap(jobs_per_location_heatmap_tidy_df, 
                          y='Location', 
                          x = 'Job', 
                          z = 'Vacancies',
                          color_continuous_scale= heatmap_px_colorscale,
                          #color_continuous_scale="Blues_r",
                          height=1000,
                          width=800, 
                          title= '<b>Demand Per Location & Data Job Category</b>',
                          labels={"Job": "<b>Data Job Category</b>",
                                  "Location":"<b>Location</b>"},
                          )
fig.update_layout(title_x=0.5, coloraxis_colorbar=dict(title="<b>Vacancies</b>"), 
                  font=font_px)
fig.update_xaxes(categoryorder='array', categoryarray= heatmap_px_categoryarray)
fig.write_image("Figures/Fig4_DemandPerLocationAndDataJobCategory2.png", scale=2)
fig.write_image("Figures/Fig4_DemandPerLocationAndDataJobCategory2.svg", scale=2)
fig.show(config=config)
BI AnalystBusiness AnalystData AnalystData ArchitectData EngineerData ScientistML EngineerZacatecasYucatánVeracruzTlaxcalaTamaulipasTabascoSonoraSinaloaSan Luis PotosíQuintana RooQuerétaroPueblaOaxacaNuevo LeónNayaritMorelosMichoacánJaliscoHidalgoGuerreroGuanajuatoEstado de MéxicoDurangoColimaCoahuilaCiudad de MéxicoChihuahuaChiapasCampecheBaja California SurBaja CaliforniaAguascalientes020406080100VacanciesDemand Per Location & Data Job CategoryData Job CategoryLocation

From the plots above, it is possible to observe that, undoubtedly, Mexico City, Jalisco and Nuevo León are the locations where most of the data jobs are demanded, while the rest of the country is lagging behind in terms of data jobs creation.

However, it is important to remark that the data jobs demand is largely concentrated in Mexico City, whereas Jalisco, Nuevo León, and Estado de México are distant second, third and fourth places, respectively.

Moreover, it is noteworthy that the Data Analyst position is the one most demanded across the Mexican States; whereas Data Architect and ML Engineer are the less demanded, as they are mostly concentrated in Mexico City and Jalisco.

5.4 What are the companies demanding more data jobs? ¶

To answer this question, a treemap was drawn using Plotly in which the size of the squares was proportional to the number of vacancies published by a company.

So, first, a new dataframe was created using the groupby and count methods from Pandas. Then, the vacancies published under the Confidential umbrella were removed. After that, the datarame was sliced to get only the top 15 companies demanding data jobs.

Finally, the labels for the treemap were also sliced to shorten them, and provide the treemap with a more aesthetic outlook.

In [ ]:
# Building the Dataframe with the top 15 companies demanding data jobs
top_companies = 15
top_companies_df = (df.groupby(by = 'Company', as_index= False)['Job'].count()
                      .sort_values(by = 'Job', ascending = False)
                      .rename(columns = {'Job': 'Vacancies'})[:top_companies]
                      .set_index('Company').drop('Confidential').reset_index()
                      )
top_companies_df['Company'] = top_companies_df['Company'].apply(lambda x: x[:16])

top_companies_df.head()
Out[ ]:
Company Vacancies
0 Bairesdev Llc / 49
1 Banamex 20
2 Pepsico División 15
3 Softtek 9
4 Michael Page 8
In [ ]:
# Treemap with Plotly
fig = px.treemap(top_companies_df, 
                values='Vacancies',
                path = [px.Constant("."), 'Company'],  
                color = 'Vacancies', 
                color_continuous_scale=px.colors.sequential.Blues,
                title= f'<b>Top {top_companies} Companies Demanding Data Jobs</b>',
                height= 600,
                width = 1000
                )
fig.update_layout(title_x=0.5, 
                  coloraxis_colorbar=dict(title="<b>Vacancies</b>"), 
                  font=font_px)
fig.write_image("Figures/Fig5_TopCompaniesDemandingDataJobs.png", scale=2)
fig.write_image("Figures/Fig5_TopCompaniesDemandingDataJobs.svg", scale=2)
fig.show(config=config)
.Bairesdev Llc / BanamexPepsico DivisiónSofttekBancoppelCapgemini MéxicoMichael PageServicios Axity AdeccoManpower, S.A. DGrupo SalinasIds ComercialCognizant TechnoEy51015202530354045VacanciesTop 15 Companies Demanding Data Jobs

From the plot above, it is clear that Bairesdev, Banamex, Pepsico and Softek are the companies with the highest data jobs demand at the moment of the present study. So, even though, the data jobs demand may vary along time, the current interest of such companies in data science and analytics might convert them in interesting prospects for job seekers.

It is also noteworthy that most of the top companies demanding data jobs are recruitment agencies, tech consultancy firms, and banks.

5.5 How is the data jobs demand per company? ¶

To answer this question, a heatmap was drawn in order to show what data jobs categories are demanded by company, in which the lighter the color, the larger the demand.

As usual, a dataframe with the data for top 30 companies demanding data jobs was prepared using the method pivot_table from Pandas.

Of course, the vacancies without a disclosed company were removed from the dataset.

Heatmaps were drawn using both Seaborn and Plotly.

In [ ]:
# Dataframe for Heatmap with top 30 companies demanding data jobs
top_companies_data_jobs = 30
jobs_per_company_df = (pd.pivot_table(data = df, index = 'Company', columns = 'Job', 
                                     values = 'Location', aggfunc = 'count')
                      .fillna(0).reset_index())
jobs_per_company_df['Total'] = jobs_per_company_df.sum(axis=1, numeric_only= True)
jobs_per_company_df = (jobs_per_company_df.sort_values('Total', ascending = False)
                      .set_index('Company').drop('Confidential')[:top_companies_data_jobs]
                      .drop(columns = 'Total').sort_values('Company', ascending = True)
                      )
jobs_per_company_df.head()
Out[ ]:
Job BI Analyst Business Analyst Data Analyst Data Architect Data Engineer Data Scientist ML Engineer
Company
Adecco 1.0 0.0 3.0 0.0 3.0 0.0 0.0
Bairesdev Llc / (Tax I... 2.0 19.0 12.0 2.0 8.0 4.0 2.0
Banamex 1.0 6.0 10.0 0.0 0.0 3.0 0.0
Bancoppel 0.0 0.0 0.0 1.0 2.0 5.0 0.0
Banorte 1.0 0.0 2.0 0.0 0.0 2.0 0.0
In [ ]:
# Heatmap with Seaborn
plt.figure(figsize = (8,10))
formatter = mpl.ticker.StrMethodFormatter('{x:,.0f}')
ax = sns.heatmap(jobs_per_company_df, 
                cmap = "Blues_r",
                mask = (jobs_per_company_df == 0),
                cbar_kws={"format": formatter})
ax.set_facecolor('black')
cbar = ax.collections[0].colorbar
cbar.ax.tick_params(labelsize=11)
plt.xlabel('Data Job Category')
plt.ylabel(f'Company')
plt.xticks(rotation = 330)
plt.title(f'Demand Per Company (Top {top_companies_data_jobs}) & Data Job Category\n')
plt.grid(False)
plt.savefig('Figures/Fig6_DemandPerCompanyAndDataJobCategory.png',  bbox_inches = 'tight')
plt.show()
2024-01-17T10:14:28.581694 image/svg+xml Matplotlib v3.7.4, https://matplotlib.org/

Likewise, as heatmaps with Plotly require the dataset to be in a tidy (or long) form. The data was transformed once again using the method melt from Pandas.

In [ ]:
# Dataframe for Heatmap with top 30 companies demanding data jobs in tidy format
jobs_per_company_tidy_df = pd.melt(jobs_per_company_df.reset_index().sort_values('Company', ascending = False), 
                                  id_vars = 'Company', var_name = 'Job', value_name = 'Vacancies')
jobs_per_company_tidy_df.head()
Out[ ]:
Company Job Vacancies
0 Softtek BI Analyst 1.0
1 Servicios Axity Mexico... BI Analyst 0.0
2 Santander BI Analyst 0.0
3 Reclutamiento En Tecno... BI Analyst 0.0
4 Pepsico División Foods BI Analyst 1.0
In [ ]:
# Heatmap with Plotly
fig = px.density_heatmap(jobs_per_company_tidy_df, 
                         y='Company', 
                         x='Job', 
                         z='Vacancies',
                         color_continuous_scale= heatmap_px_colorscale,
                         height=1000,
                         width=900, 
                         title= f'<b>Demand Per Company (Top {top_companies_data_jobs}) & Data Job Category</b>',
                         labels={'Job': '<b>Data Job Category</b>',
                                  'Company': '<b>Company</b>'},
                          )
fig.update_layout(title_x=0.5, 
                  coloraxis_colorbar=dict(title="<b>Vacancies</b>"), 
                  font=font_px)
fig.update_xaxes(categoryorder='array', categoryarray=heatmap_px_categoryarray)
fig.write_image("Figures/Fig6_DemandPerCompanyAndDataJobCategory2.png", scale=2)
fig.write_image("Figures/Fig6_DemandPerCompanyAndDataJobCategory2.svg", scale=2)
fig.show(config=config)
BI AnalystBusiness AnalystData AnalystData ArchitectData EngineerData ScientistML EngineerSofttekServicios Axity Mexico...SantanderReclutamiento En Tecno...Pepsico División FoodsNueva Wal Mart De Mexi...Michael PageManpower, S.A. De C.V.LiverpoolKairós Digital Solutio...Johnson ControlsInacombIds ComercialHitssGrupo SalinasGrabjobsGonet, S.A. De C.V.EyEverisEnterprise Solutions, ...Consultoria Y Operacio...Cognizant Technology S...CitigroupCapgemini MéxicoBbva BancomerBanorteBancoppelBanamexBairesdev Llc / (Tax I...Adecco024681012141618VacanciesDemand Per Company (Top 30) & Data Job CategoryData Job CategoryCompany

From the plots above, it is possible to see that, indeed, Data Analyst and Data Engineer positions are the ones most demanded by the companies in Mexico at the moment of this study; whereas Data Architect and ML Engineer positions are the less demanded.

Notwithstanding with the above, the heatmap suggests that Data Analyst and Data Engineer positions are more demanded across different organizations. On the contrary, Data Scientist, Data Architect and, certainly, ML Engineers vacancies are demanded in more specific organizations like tech consulting companies and banks.

5.6 Where do the companies demanding data jobs locate? ¶

To answer this question, a heatmap was drawn in order to show the companies demanding data jobs and its most common locations. Likewise, the lighter the color, the greater the demand.

In this context, two dataframes were prepared. One with all the companies and its locations, and the second one with only the top 30 companies demanding data jobs. This, with the purpose of drawing two corresponding heatmaps.

The dataframe with the data for all the companies demanding data jobs was prepared by using the method pivot_table from Pandas. Then, the second dataframe was prepared by means of slicing. If the demand in a specific state did not exist, the data was filled with zeros. The vacancies without a disclosed company were removed from the dataset.

Finally, heatmaps were drawn using both Seaborn and Plotly.

In [ ]:
# Base dataframe for analyzing the location of the companies demanding data jobs
companies_by_location_df = (pd.pivot_table(data = df, index = 'Company', columns = 'Location', 
                           values = 'Job', aggfunc = 'count').transpose()
                          .join(states_df, how = 'outer').drop(columns =['ID', 'Lat', 'Long'])
                          .transpose().fillna(0).astype(float).drop('Confidential')
                          .assign(Total=lambda d: d.sum(1)).sort_values('Total', ascending = False)
                          .drop(columns = 'Total').reset_index().rename(columns = {'index': 'Company'})
                          .set_index('Company')                       
                          )     
companies_by_location_df.head()
Out[ ]:
Aguascalientes Baja California Baja California Sur Campeche Chiapas Chihuahua Ciudad de México Coahuila Colima Durango ... Remote/NA San Luis Potosí Sinaloa Sonora Tabasco Tamaulipas Tlaxcala Veracruz Yucatán Zacatecas
Company
Bairesdev Llc / (Tax I... 0.0 1.0 0.0 0.0 0.0 0.0 24.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
Banamex 0.0 0.0 0.0 0.0 0.0 0.0 20.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Pepsico División Foods 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 ... 9.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
Softtek 1.0 0.0 0.0 0.0 0.0 0.0 3.0 0.0 0.0 0.0 ... 5.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Capgemini México 2.0 0.0 0.0 0.0 0.0 0.0 4.0 0.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 33 columns

In [ ]:
# Dataframe with all companies and its locations
all_companies_by_location_df = (companies_by_location_df                                
                                .sort_index()
                                .transpose()
                            )
all_companies_by_location_df.head()
Out[ ]:
Company 5-18 Consulting Group ... Addon Technologies Inc. Adecco Adi Global Aerovías De México, Sa... Ait Vanguardia Tecnoló... Alia Aliat Universidades Alten Ingenieria Mexic... Arca Continental ... Universidad Latinoamer... Universidad Popular Au... Universidad Tecmilenio Volaris Waldo'S Dolar Mart De ... Wipro Technologies, S.... Work Cast Solutions Sa... Workable Ats Zegovia Rh Sa De Cv Zillow Group Inc.
Aguascalientes 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Baja California 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Baja California Sur 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Campeche 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Chiapas 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 255 columns

In [ ]:
# Top 30 companies demanding data jobs and its locations
top_companies_by_location = 30
top_companies_by_location_df = (companies_by_location_df[:top_companies_by_location]
                                .sort_index()
                                )
top_companies_by_location_df.head()
Out[ ]:
Aguascalientes Baja California Baja California Sur Campeche Chiapas Chihuahua Ciudad de México Coahuila Colima Durango ... Remote/NA San Luis Potosí Sinaloa Sonora Tabasco Tamaulipas Tlaxcala Veracruz Yucatán Zacatecas
Company
Adecco 0.0 0.0 0.0 0.0 0.0 0.0 6.0 0.0 0.0 0.0 ... 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Bairesdev Llc / (Tax I... 0.0 1.0 0.0 0.0 0.0 0.0 24.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
Banamex 0.0 0.0 0.0 0.0 0.0 0.0 20.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Bancoppel 0.0 0.0 0.0 0.0 0.0 0.0 5.0 0.0 0.0 0.0 ... 0.0 0.0 3.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Banorte 0.0 0.0 0.0 0.0 0.0 0.0 5.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 33 columns

In [ ]:
# Heatmap of the top 30 companies demanding data jobs and their locations in Seaborn
plt.figure(figsize = (12,9))
ax = sns.heatmap(top_companies_by_location_df, 
                cmap = "Blues_r",
                mask = (top_companies_by_location_df == 0))
ax.set_facecolor('xkcd:black')
cbar = ax.collections[0].colorbar
cbar.ax.tick_params(labelsize=12)
plt.xlabel('Location')
plt.ylabel(f'Company')
plt.title(f'Demand Per Company (Top {top_companies_by_location}) & Location\n')
plt.grid(False)
plt.savefig('Figures/Fig7_DemandPerCompanyAndLocationTop30.png',  bbox_inches = 'tight')
plt.show()
2024-01-17T10:14:32.480954 image/svg+xml Matplotlib v3.7.4, https://matplotlib.org/

Then, to use Plotly, the dataframe was converted into a tidy (or long) format.

In [ ]:
# Converting top 30 companies and their locations dataframe into tidy data
top_companies_by_location_tidy_df = pd.melt(top_companies_by_location_df.reset_index()
                                            .sort_values(by = 'Company', ascending = False), 
                                            id_vars = 'Company', var_name = 'Location', value_name = 'Vacancies')
top_companies_by_location_tidy_df.head()
Out[ ]:
Company Location Vacancies
0 Softtek Aguascalientes 1.0
1 Sociedad Nacional Prom... Aguascalientes 0.0
2 Servicios Axity Mexico... Aguascalientes 0.0
3 Santander Aguascalientes 0.0
4 Reclutamiento En Tecno... Aguascalientes 0.0
In [ ]:
# Heatmap of the top 30 companies demanding data jobs and their locations in Plotly
fig = px.density_heatmap(top_companies_by_location_tidy_df, 
                          y='Company', 
                          x='Location', 
                          z='Vacancies',
                          color_continuous_scale= heatmap_px_colorscale,
                          height=900,
                          width=1100, 
                          title= f'<b>Demand Per Company (Top {top_companies_by_location}) & Location</b>',
                          labels={'Location': '<b>Location</b>',
                                  'Company': '<b>Company</b>'},                      
                          )
fig.update_layout(title_x=0.5, 
                  coloraxis_colorbar=dict(title="<b>Vacancies</b>"), 
                  font=font_px)
fig.update_xaxes(tickangle = 270)
fig.write_image("Figures/Fig7_DemandPerCompanyAndLocationTop30_2.png", scale=2)
fig.write_image("Figures/Fig7_DemandPerCompanyAndLocationTop30.svg", scale=2)
fig.show(config=config)
AguascalientesBaja CaliforniaBaja California SurCampecheChiapasChihuahuaCiudad de MéxicoCoahuilaColimaDurangoEstado de MéxicoGuanajuatoGuerreroHidalgoJaliscoMichoacánMorelosNayaritNuevo LeónOaxacaPueblaQuerétaroQuintana RooRemote/NASan Luis PotosíSinaloaSonoraTabascoTamaulipasTlaxcalaVeracruzYucatánZacatecasSofttekSociedad Nacional Prom...Servicios Axity Mexico...SantanderReclutamiento En Tecno...Pepsico División FoodsNueva Wal Mart De Mexi...Michael PageManpower, S.A. De C.V.LiverpoolKairós Digital Solutio...Johnson ControlsIds ComercialHitssGrupo SalinasGrabjobsGonet, S.A. De C.V.EyEverisEnterprise Solutions, ...Consultoria Y Operacio...Cognizant Technology S...CitigroupCapgemini MéxicoBbva BancomerBanorteBancoppelBanamexBairesdev Llc / (Tax I...Adecco05101520VacanciesDemand Per Company (Top 30) & LocationLocationCompany

As expectable, most of the companies locate in Mexico City as the large majority of the vacancies are offered there. However, the heatmap shows that there are some organizations that are spread across several Mexican states such as Bairesdev or Pepsico.

Futhermore, there are few well-known companies whose data jobs demand is not located in the capital region, such as Jonhson Controls which is located in Nuevo León.

Then, the location of all companies demanding data jobs was explored with a second heatmap.

In [ ]:
# Heatmap of all companies demanding data jobs and their locations with Matplotlib
plt.figure(figsize = (12,9))
ax = sns.heatmap(all_companies_by_location_df, 
                cmap = "Blues_r",
                mask = (all_companies_by_location_df == 0))
ax.set_facecolor('xkcd:black')
cbar = ax.collections[0].colorbar
cbar.ax.tick_params(labelsize=12)
plt.xlabel('Companies Demanding Data Jobs')
plt.ylabel('Location')
plt.xticks([])
plt.title('Demand Per Company & Location\n')
plt.grid(False)
plt.savefig('Figures/Fig8_DemandPerCompanyAndLocation.png',  bbox_inches = 'tight')
plt.show()
2024-01-17T10:14:35.664939 image/svg+xml Matplotlib v3.7.4, https://matplotlib.org/
In [ ]:
# Converting all companies and their locations dataframe into tidy data
all_companies_by_location_tidy_df = (all_companies_by_location_df.reset_index()
                                      .rename(columns = {'index': 'Location'})
                                      .set_index('Location')
                                      .transpose().reset_index()
                                      .sort_values('Company', ascending = True)
                                    )
all_companies_by_location_tidy_df = pd.melt(all_companies_by_location_tidy_df, 
                                            id_vars = 'Company', var_name= 'Location', 
                                            value_name = 'Vacancies').sort_values(by='Location', ascending = False)
all_companies_by_location_tidy_df.head()
Out[ ]:
Company Location Vacancies
8414 Zillow Group Inc. Zacatecas 0.0
8287 Inacomb Zacatecas 0.0
8253 Financiera Sustentable... Zacatecas 0.0
8252 Financiera Independenc... Zacatecas 0.0
8251 Femsa Zacatecas 0.0
In [ ]:
# Heatmap of all companies demanding data jobs and their locations with Plotly
fig = px.density_heatmap(all_companies_by_location_tidy_df, 
                          y='Location', 
                          x='Company', 
                          z='Vacancies',
                          color_continuous_scale= heatmap_px_colorscale,
                          height=900,
                          width=1000, 
                          title= '<b>Demand Per Company & Location</b>',
                          labels={'Location': '<b>Location</b>',
                                  'Company': ''}, 
                         )
fig.update_layout(title_x=0.5, 
                  coloraxis_colorbar=dict(title="<b>Vacancies</b>"), 
                  font=font_px,
                  xaxis={'side': 'top'})
fig.update_xaxes(showticklabels=False)
fig.add_annotation(
    xref="paper", 
    yref="paper",
    x=0.5,
    y=-0.05,
    text='<b>Companies Demanding Data Jobs</b>',
    showarrow=False,
    font=dict(size=17, color ='dimgray')
)
fig.write_image("Figures/Fig8_DemandPerCompanyAndLocation2.png", scale=2)
fig.write_image("Figures/Fig8_DemandPerCompanyAndLocation2.svg", scale=2)
fig.show(config=config)
ZacatecasYucatánVeracruzTlaxcalaTamaulipasTabascoSonoraSinaloaSan Luis PotosíRemote/NAQuintana RooQuerétaroPueblaOaxacaNuevo LeónNayaritMorelosMichoacánJaliscoHidalgoGuerreroGuanajuatoEstado de MéxicoDurangoColimaCoahuilaCiudad de MéxicoChihuahuaChiapasCampecheBaja California SurBaja CaliforniaAguascalientes05101520VacanciesDemand Per Company & LocationLocationCompanies Demanding Data Jobs

Likewise, the above heatmap beautifully shows that most of the data jobs are concentrated in Ciudad de México and, to a lesser extent, in Nuevo León and Jalisco.

On the other hand, it is also noteworthy that remote positions are also becoming a trend among companies demanding datajobs.

5.7 How many salary observations are for each data job category? ¶

To answer this question, a bar plot was drawn. And, to draw such a figure, the corresponding dataset was prepared by using the methods groupby and count from Pandas.

It is also important to bear in mind that, from now on, the second dataset with only the vacancies with disclosed salary information, _salarydf, was used for analysis.

The bar plots were drawn using both Matplotlib and Plotly.

In [ ]:
# Count of observations per Data Job Category
obs_per_job_df = (salary_df.groupby('Job')[['Avg Salary']].count()
                  .rename(columns = {'Avg Salary': 'Observations'})
                  .sort_values('Observations', ascending = True)
                  )
obs_per_job_df
Out[ ]:
Observations
Job
ML Engineer 1
Data Architect 9
Data Scientist 10
Business Analyst 23
BI Analyst 26
Data Engineer 38
Data Analyst 90
In [ ]:
# Bar plot of the salary observations per data job category in Matplotlib

obs_per_job_plt = obs_per_job_df.plot.barh(color = sns.color_palette('Blues_r')[0], 
                                           figsize = (7, 6), 
                                           alpha = 0.6, 
                                           width = 0.9)
for container in obs_per_job_plt.containers:
    obs_per_job_plt.bar_label(container, fmt='{:,.0f}', fontweight='bold')
plt.legend([])
plt.xlabel('Observations with Disclosed Salary')
plt.ylabel('Data Job Category')
plt.title('Number of Salary Observations Per Data Job Category\n')
plt.savefig('Figures/Fig9_NumberOfSalaryObservationsPerDataJobCategory.png',  bbox_inches = 'tight')
plt.show()
2024-01-17T10:14:39.083552 image/svg+xml Matplotlib v3.7.4, https://matplotlib.org/
In [ ]:
# Bar plot of the salary observations per data job category in Plotly

# Defining bar colors
colors = [px.colors.sequential.Blues[4],]*8
colors[0] = px.colors.sequential.Blues[7]
colors[1] = px.colors.sequential.Blues[7]
colors[2] = px.colors.sequential.Blues[7]

# Bar plot
fig = px.bar(obs_per_job_df.reset_index(), 
             x='Observations', 
             y='Job',
             color = 'Observations',
             color_continuous_scale=px.colors.sequential.Blues,
             title= '<b>Number of Salary Observations Per Data Job Category</b>',
             labels={'Job': '<b>Data Job Category</b>',
                     'Observations': '<b>Observations with Disclosed Salary</b>'},
             height = 600,
             width = 800,
             text_auto=True   
             )
fig.update_layout(title_x=0.5, 
                  font=font_px)
fig.update_traces(marker_color=colors, 
                  marker_line_color='white',
                  marker_line_width=1, 
                  opacity=0.6)
fig.write_image("Figures/Fig9_NumberOfSalaryObservationsPerDataJobCategory2.png", scale=2)
fig.write_image("Figures/Fig9_NumberOfSalaryObservationsPerDataJobCategory2.svg", scale=2)
fig.show(config=config)
191023263890020406080ML EngineerData ArchitectData ScientistBusiness AnalystBI AnalystData EngineerData AnalystNumber of Salary Observations Per Data Job CategoryObservations with Disclosed SalaryData Job Category

In view of the plots above, it is necessary to be careful with the salary results for the Data Scientist, Data Architect, and ML Engineer positions, as less than 20 observations were collected. However, it is very positive to have collected salary observations for all data jobs categories.

5.8 Which data job category has the highest salary? ¶

To answer this question, bar charts and boxplots were drawn based on the salaries for each data job category.

For drawing the bar chart, a dataframe was built by using the groupby and mean methods from pandas.

On the contrary, for drawing the boxplots, the original dataframe salary_df was used in order to show the dispersion of the salary observations for each data job category.

Finally, the bar charts and boxplots were drawn using both Seaborn and Plotly.

In [ ]:
# Dataframe with the Avg Monthly Salary per Data Job Category
salary_per_job_df = (salary_df.groupby('Job')[['Avg Salary']].mean()
                     .reset_index()
                     .sort_values('Avg Salary', ascending = False)
                    )
salary_per_job_df
Out[ ]:
Job Avg Salary
6 ML Engineer 65000.000000
3 Data Architect 54222.222222
4 Data Engineer 45032.894737
5 Data Scientist 43150.050000
1 Business Analyst 31893.326087
0 BI Analyst 28632.711538
2 Data Analyst 24026.205556
In [ ]:
# Average Salary per Data Job Category with Matplotlib

palette = [sns.color_palette('Blues_r')[3],]*7
palette[0] = sns.color_palette('Blues_r')[0]
           
plt.figure(figsize = (7, 6))
ax = sns.barplot(data=salary_per_job_df, 
                 x='Avg Salary', 
                 y='Job',   
                 hue='Job',
                 palette=palette,
                 legend=False,
                 alpha=0.7, 
                 edgecolor='white')
ax.xaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('${x:,.0f}'))
for container in ax.containers:
    ax.bar_label(container, fmt='${:,.0f}', fontweight='bold')
plt.legend([])
plt.xlabel('Average Monthly Salary (MXN)')
plt.ylabel('Data Job Category')
plt.grid({'both'})
plt.title('Average Salary Per Data Job Category\n')
plt.savefig('Figures/Fig10_AverageSalaryPerDataJobCategory.png',  bbox_inches = 'tight')
plt.show()
2024-01-17T10:14:40.837565 image/svg+xml Matplotlib v3.7.4, https://matplotlib.org/
In [ ]:
# Average Salary per Data Job Category with Plotly

colors = [px.colors.sequential.Blues[4],]*6
colors.append(px.colors.sequential.Blues[7])

fig = px.bar(salary_per_job_df.sort_values('Avg Salary', ascending=True), 
             x='Avg Salary', 
             y='Job',
             color='Avg Salary',             
             color_continuous_scale=px.colors.sequential.Blues,
             title='<b>Average Salary Per Data Job Category</b>',
             labels={'Job': '<b>Data Job Category</b>', 
                     'Avg Salary': '<b>Average Monthly Salary (MXN)</b>'},
             height=600,
             width=800,
             opacity=0.6,
             text_auto='$.3s'   
             )
fig.update_layout(#yaxis={'categoryorder':'array', 'categoryarray': order}, 
                  title_x=0.5, 
                  font=font_px)
fig.update_traces(marker_color=colors, 
                  marker_line_color='white', 
                  marker_line_width=1)
fig.update_xaxes(tickformat='$,~s')
fig.write_image("Figures/Fig10_AverageSalaryPerDataJobCategory2.png", scale=2)
fig.write_image("Figures/Fig10_AverageSalaryPerDataJobCategory2.svg", scale=2)
fig.show(config=config)
d:\DataJobsMX-Nov2023\.venv\lib\site-packages\numpy\core\numeric.py:2468: FutureWarning:

elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison

$24.0k$28.6k$31.9k$43.2k$45.0k$54.2k$65.0k$0$10k$20k$30k$40k$50k$60kData AnalystBI AnalystBusiness AnalystData ScientistData EngineerData ArchitectML EngineerAverage Salary Per Data Job CategoryAverage Monthly Salary (MXN)Data Job Category

From the bar charts above, it is clear that ML Engineer positions are the data job category with the highest monthly average salary with about $65,000 MXN.

Furthermore, the second place corresponds to the Data Architect positions with an average monthly salary of about $55,000 MXN.

The rest of the data job positions have the following approximate average monthly salaries:

  • Data Engineer: $45,000

  • Data Scientist: $43,000

  • Business Analyst: $32,000

  • BI Analyst: $29,000

  • Data Analyst: $24,000

However, the calculated monthly average salaries might be affected by outliers. So, the monthly median salaries were estimated with the help of the boxplots below.

In [ ]:
# Box plot for each Data Job Category with Matplotlib
plt.figure(figsize=(10,6))
ax = sns.boxplot(data=salary_df, 
                 x='Job', 
                 y='Avg Salary',                  
                 order = salary_per_job_df.Job,
                 #hue='Job',
                 palette = sns.color_palette("Blues_r", 7),
                 #legend=False
                 )
ax.yaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('${x:,.0f}'))
plt.xlabel('Data Job Category')
plt.ylabel('Average Monthly Salary (MXN)')
plt.title('Salary Per Data Job Category\n')
plt.savefig('Figures/Fig11_SalaryPerDataJobCategory.png',  bbox_inches = 'tight')
plt.show()
C:\Users\dlope\AppData\Local\Temp\ipykernel_2476\2368826657.py:3: FutureWarning:



Passing `palette` without assigning `hue` is deprecated and will be removed in v0.14.0. Assign the `x` variable to `hue` and set `legend=False` for the same effect.


2024-01-17T10:14:42.391567 image/svg+xml Matplotlib v3.7.4, https://matplotlib.org/
In [ ]:
# Box plot for each Data Job Category with Plotly

fig = px.box(salary_df, 
             x="Job", 
             y="Avg Salary", 
             color="Job", 
             points="all", 
             color_discrete_sequence=px.colors.sequential.Blues_r,
             category_orders={"Job": salary_per_job_df.Job},
             labels={"Avg Salary": "<b>Average Monthly Salary (MXN)</b>",
                     "Job": "<b>Data Job Category</b>"},
             title='<b>Salary Per Data Job Category</b>',
             height=550,
             width=1000
             )
fig.update_layout(title_x=0.5, 
                  font=font_px)
fig.update_traces(showlegend=False)
fig.update_yaxes(tickformat = '$,~s')
fig.write_image("Figures/Fig11_SalaryPerDataJobCategory2.png", scale=2)
fig.write_image("Figures/Fig11_SalaryPerDataJobCategory2.svg", scale=2)
fig.show(config=config)
ML EngineerData ArchitectData EngineerData ScientistBusiness AnalystBI AnalystData Analyst$0$10k$20k$30k$40k$50k$60k$70k$80k$90kSalary Per Data Job CategoryData Job CategoryAverage Monthly Salary (MXN)
In [ ]:
# Calculating median average salaries by data job category

(salary_df.groupby('Job')[['Avg Salary']].median()
.sort_values('Avg Salary', ascending=False)
.rename(columns={"Avg Salary":"Median Salary"}))
Out[ ]:
Median Salary
Job
ML Engineer 65000.0
Data Architect 57500.0
Data Engineer 41500.0
Data Scientist 38250.0
Business Analyst 31000.0
BI Analyst 27500.0
Data Analyst 19000.0

The boxplot suggests that, after removing outliers from the calculation of the salaries distributions, ML Engineer positions are still the ones with the highest salaries in the current Mexican labor market with a monthly median salary of about 65,000 MXN. However, it is very important to bear in mind that only one salary observation has been collected for this data job category; so, this result should be taken with caution.

Then, the second monthly highest median salary correspond to Data Architect positions with about 57,500 MXN.

On the other hand, it was found that Data Engineer and Data Scientist positions have a median monthly salary of about 41,500 MXN and about 38,250 MXN, respectively.

Moreover, Business Analyst and BI Analyst positions have a median monthly salary of about 31,000 MXN and about 27,500 MXN, respectively.

Finally, Data Analyst positions are the ones with the lowest salaries in the Data Jobs labor market with only a median monthly salary of about 19,00 MXN.

Please refer to the section 6. Statistical Analysis below of the present notebook for a set of statistical tests applied to the salaries of each data job category.

In [ ]:
# Dataframe with the Mean and Median Monthly Salary per Data Job Category
avg_salary_per_job_df = (salary_df.groupby('Job')[['Avg Salary']].aggregate(['mean', 'median'])                     
                     .droplevel(level=0, axis=1)
                     .rename(columns={'mean':'Mean Salary', 'median':'Median Salary'}).
                     reset_index()                                       
                    )
avg_salary_per_job_df = pd.melt(avg_salary_per_job_df, id_vars='Job', var_name='Measure', value_name='Value')
avg_salary_per_job_df
Out[ ]:
Job Measure Value
0 BI Analyst Mean Salary 28632.711538
1 Business Analyst Mean Salary 31893.326087
2 Data Analyst Mean Salary 24026.205556
3 Data Architect Mean Salary 54222.222222
4 Data Engineer Mean Salary 45032.894737
5 Data Scientist Mean Salary 43150.050000
6 ML Engineer Mean Salary 65000.000000
7 BI Analyst Median Salary 27500.000000
8 Business Analyst Median Salary 31000.000000
9 Data Analyst Median Salary 19000.000000
10 Data Architect Median Salary 57500.000000
11 Data Engineer Median Salary 41500.000000
12 Data Scientist Median Salary 38250.000000
13 ML Engineer Median Salary 65000.000000
In [ ]:
# Mean and Median Salary per Data Job Category with Matplotlib

palette = [sns.color_palette('Blues_r')[0],sns.color_palette('Blues_r')[3]]
           
plt.figure(figsize = (7, 6))
ax = sns.barplot(data=avg_salary_per_job_df.sort_values('Value', ascending=False), 
                 x='Value', 
                 y='Job',   
                 hue='Measure',
                 palette=palette,
                 legend=True,
                 alpha=0.7, 
                 edgecolor='white')
ax.xaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('${x:,.0f}'))
for container in ax.containers:
    ax.bar_label(container, fmt='${:,.0f}', fontweight='bold')
plt.legend(facecolor='white')
plt.xlabel('Monthly Salary (MXN)')
plt.ylabel('Data Job Category')
plt.grid({'both'})
plt.title('Mean & Median Salary Per Data Job Category\n')
plt.savefig('Figures/Fig12_MeanMedianSalaryPerDataJob.png',  bbox_inches = 'tight')
plt.show()
2024-01-17T10:14:44.511551 image/svg+xml Matplotlib v3.7.4, https://matplotlib.org/
In [ ]:
avg_salary_per_job_df.sort_values('Value', ascending=True).query("Measure == 'Mean Salary'")['Job']
Out[ ]:
2        Data Analyst
0          BI Analyst
1    Business Analyst
5      Data Scientist
4       Data Engineer
3      Data Architect
6         ML Engineer
Name: Job, dtype: object
In [ ]:
y = avg_salary_per_job_df.sort_values('Value', ascending=True).query("Measure == 'Mean Salary'")['Job']
x_mean = avg_salary_per_job_df.sort_values('Value', ascending=True).query("Measure == 'Mean Salary'")['Value']
x_median = avg_salary_per_job_df.sort_values('Value', ascending=True).query("Measure == 'Median Salary'")['Value']

opacity=0.7

fig = go.Figure()

fig.add_trace(go.Bar(y=y,
                x=x_mean,
                name='Mean Salary',
                marker_color=px.colors.sequential.Blues[7],
                opacity=opacity,    
                text=['$'+str(round(i/1000,1))+'k' for i in x_mean],                         
                orientation='h'           
                ))

fig.add_trace(go.Bar(y=y,
                x=x_median,
                name='Median Salary',
                marker_color=px.colors.sequential.Blues[4],
                opacity=opacity,
                text=['$'+str(round(i/1000,1))+'k' for i in x_median],                
                orientation='h'           
                ))

fig.update_layout(title='<b>Mean & Median Salary Per Data Job Category</b>',
                  title_x=0.5, 
                  yaxis=dict(title='<b>Data Job Category</b>'),
                  xaxis=dict(title='<b>Monthly Salary (MXN)</b>'),
                  height=600,
                  width=800,
                  barmode='group',
                  font=font_px,
                  legend=legend_px,
                  )
fig.update_traces(
                  marker_line_color='white', 
                  marker_line_width=1,                  
                  )
fig.update_xaxes(tickformat='$,~s')
fig.write_image("Figures/Fig12_MeanMedianSalaryPerDataJob2.png", scale=2)
fig.write_image("Figures/Fig12_MeanMedianSalaryPerDataJob2.svg", scale=2)
fig.show(config=config)
$24.0k$28.6k$31.9k$43.2k$45.0k$54.2k$65.0k$19.0k$27.5k$31.0k$38.2k$41.5k$57.5k$65.0k$0$10k$20k$30k$40k$50k$60kData AnalystBI AnalystBusiness AnalystData ScientistData EngineerData ArchitectML EngineerMean SalaryMedian SalaryMean & Median Salary Per Data Job CategoryMonthly Salary (MXN)Data Job Category

5.9 Where the highest salaries can be found? ¶

To answer this question, a heatmap was drawn to show the relationship among location, data job category and average salary. Likewise, the lighter the color, the higher the salary.

To do so, the appropriate dataframe was build using the methods pivot_table and join from Pandas. In this sense, the pivot salary table by location was joined with the states dataframe in order to display all the states within Mexico, and not only those with vacancies. Moreover, not disclosed locations were removed.

Finally, heatmaps were drawn using both Seaborn and Plotly.

In [ ]:
# Dataframe with averages salaries per location and data job category
salary_location_df = (pd.pivot_table(data = salary_df, 
                                     index = 'Location', 
                                     columns = 'Job', 
                                     values = 'Avg Salary', 
                                     aggfunc= 'mean')
                      .join(states_df, how = 'outer').drop(columns =['ID', 'Lat', 'Long'])
                      .fillna(0).sort_index(ascending= True)
                      )
salary_location_df.head()
Out[ ]:
BI Analyst Business Analyst Data Analyst Data Architect Data Engineer Data Scientist ML Engineer
Aguascalientes 0.0 0.0 18500.0 0.0 11000.0 0.0 0.0
Baja California 31000.0 0.0 0.0 0.0 0.0 0.0 0.0
Baja California Sur 0.0 0.0 10250.0 0.0 0.0 0.0 0.0
Campeche 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Chiapas 0.0 0.0 0.0 0.0 0.0 0.0 0.0
In [ ]:
# Heatmap of the Salaries per Data Job category and Location with Matplotlib
plt.figure(figsize = (8,10))
formatter = mpl.ticker.StrMethodFormatter('${x:,.0f}')
ax = sns.heatmap(salary_location_df, 
            cmap = "Blues_r",
            mask = (salary_location_df == 0),
            cbar_kws={"format": formatter})
ax.set_facecolor('black')
cbar = ax.collections[0].colorbar
cbar.ax.tick_params(labelsize=11)
plt.xlabel('Data Job Category')
plt.ylabel('Location')
plt.title(f'Salary Per Location And Data Job Category\n')
plt.xticks(rotation = 330)
plt.grid(False)
plt.savefig('Figures/Fig13_SalaryPerLocationAndDataJobCategory.png',  bbox_inches = 'tight')
plt.show()
2024-01-17T10:14:46.865561 image/svg+xml Matplotlib v3.7.4, https://matplotlib.org/
In [ ]:
# Transforming dataframe into tidy format
salary_location_tidy_df = salary_location_df.rename_axis('Location').reset_index()
salary_location_tidy_df = pd.melt(salary_location_tidy_df, id_vars= 'Location', var_name = 'Job', value_name = 'Salary')
salary_location_tidy_df = salary_location_tidy_df.sort_values('Location', ascending = False)
salary_location_tidy_df.head()
Out[ ]:
Location Job Salary
230 Zacatecas ML Engineer 0.0
32 Zacatecas BI Analyst 0.0
197 Zacatecas Data Scientist 0.0
98 Zacatecas Data Analyst 0.0
131 Zacatecas Data Architect 0.0
In [ ]:
# Heatmap of the Salaries per Data Job category and Location with Plotly

fig = px.density_heatmap(salary_location_tidy_df, 
                         y='Location', 
                         x = 'Job', 
                         z = 'Salary',
                         histfunc="avg", 
                         color_continuous_scale=heatmap_px_colorscale,
                         #color_continuous_scale="Blues",
                         height=1000,
                         width=900,
                         title='<b>Salary Per Location And Data Job Category</b>',
                         labels={
                                  'Job': '<b>Data Job Category</b>',
                                  'Location':'<b>Location</b>'
                                  }
                          )
fig.update_layout(title_x=0.5, 
                  font=font_px, 
                  coloraxis_colorbar=dict(title="<b>Monthly Salary <br>(MXN)</b>"))
fig.update_traces(colorbar_title_text='Monthly Salary (MXN)')
fig.update_coloraxes(colorbar_tickformat = '$,~s')
fig.update_xaxes(categoryorder='array', categoryarray=heatmap_px_categoryarray)
fig.write_image("Figures/Fig13_SalaryPerLocationAndDataJobCategory2.png", scale=2)
fig.write_image("Figures/Fig13_SalaryPerLocationAndDataJobCategory2.svg", scale=2)
fig.show(config=config)
BI AnalystBusiness AnalystData AnalystData ArchitectData EngineerData ScientistML EngineerZacatecasYucatánVeracruzTlaxcalaTamaulipasTabascoSonoraSinaloaSan Luis PotosíRemote/NAQuintana RooQuerétaroPueblaOaxacaNuevo LeónNayaritMorelosMichoacánJaliscoHidalgoGuerreroGuanajuatoEstado de MéxicoDurangoColimaCoahuilaCiudad de MéxicoChihuahuaChiapasCampecheBaja California SurBaja CaliforniaAguascalientes$0$10k$20k$30k$40k$50k$60k$70kMonthly Salary (MXN)Salary Per Location And Data Job CategoryData Job CategoryLocation

In general, the above plot suggests that the highest salaries can be found in Remote, Mexico City, Jalisco, Estado de México and San Luis Potosí. However the observation for the latter state is atypical and should be interpreted with caution.

More specifically, the highest salaries for the different data jobs categories can be found in the following states:

Data Job Category Locations with the Highest Average Salaries
BI Analyst Remote
Business Analyst Jalisco
Data Analyst Remote
Data Architect Remote
Data Engineer Jalisco
Data Scientist Ciudad de México
ML Engineer Jalisco

5.10 What companies offer the highest salaries? ¶

To answer this question, a lollipop chart was drawn using the top average salaries per company. In this sense, the larger the lollilop, the higher the salary offered by a given company.

The dataset was built by using the grouby method from Pandas.

Finally, the charts were drawn using both Matplotlib and Plotly.

In [ ]:
# Dataframe with the top averages salaries per company
top_salary_company = 20
salary_company_df = (salary_df.groupby('Company')[['Avg Salary']].mean()
                    .drop('Confidential')
                    .sort_values('Avg Salary', ascending = False)[:top_salary_company]                     
                    )
salary_company_df.head()
Out[ ]:
Avg Salary
Company
Ecosistemex S. De R.L.... 90000.000000
Caspex Corp 77500.000000
Addon Technologies Inc. 77500.000000
Enterprise Solutions, ... 71666.666667
Softtek 70000.000000
In [ ]:
# Companies paying the highest salaries with Matplotlib
color = sns.color_palette('Blues_r')[0]
plt.figure(figsize = (8,7))
ax = sns.scatterplot(data = salary_company_df,
                    x = 'Avg Salary',
                    y = 'Company',
                    color = color
                    )
ax.xaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('${x:,.0f}'))

for index, value in salary_company_df.sort_values('Avg Salary')['Avg Salary'].items():
    plt.hlines(xmax = value, xmin = np.min(salary_company_df['Avg Salary']), y = index, color=color)  # Stems

plt.xlabel('Average Monthly Salary (MXN)')
plt.ylabel(f'Company')
plt.xticks(rotation = 0)
plt.title(f'Top {top_salary_company} Companies Paying The Highest Salaries\n')
plt.savefig('Figures/Fig14_Top20SalaryPerCompany.png',  bbox_inches = 'tight')
plt.show()
2024-01-17T10:14:49.515599 image/svg+xml Matplotlib v3.7.4, https://matplotlib.org/
In [ ]:
# Companies paying the highest salaries with Plotly
color=px.colors.sequential.Blues[7]
fig = px.scatter(salary_company_df.reset_index().sort_values('Avg Salary'), 
                 x='Avg Salary', 
                 y='Company',
                 height = 700,
                 width = 800,
                 title = f'<b>Top {top_salary_company} Companies Paying The Highest Salaries</b>',
                 labels = {'Avg Salary':'<b>Average Monthly Salary (MXN)</b>',
                           'Company':'<b>Company</b>'},
                 )

fig.update_layout(title_x=0.5, font=font_px)
fig.update_traces(marker_color=color, marker_line_color=color)
fig.update_xaxes(tickformat = '$,~s')

for index, value in salary_company_df.sort_values('Avg Salary')['Avg Salary'].items():
  fig.add_shape(type='line', y0=index, y1= index, x0=np.min(salary_company_df['Avg Salary']), x1= value, xref='x',
                yref='y', line=dict(color= px.colors.sequential.Blues[7]))
  
fig.write_image("Figures/Fig14_Top20SalaryPerCompany2.png", scale=2)
fig.write_image("Figures/Fig14_Top20SalaryPerCompany2.svg", scale=2)
fig.show(config=config)
$50k$60k$70k$80k$90kTotaltechPayco Technology S De ...EverisRandstadHuman StaffSygnoGrupo Industrial Miner...Kode It S.C.AliaManpower, S.A. De C.V.Inetum Holding Mexico ...Cognizant Technology S...SantanderWork Cast Solutions Sa...Monterrey It ClusterSofttekEnterprise Solutions, ...Addon Technologies Inc.Caspex CorpEcosistemex S. De R.L....Top 20 Companies Paying The Highest SalariesAverage Monthly Salary (MXN)Company

From the plot aboves, the companies offering the highest salaries are Ecosistemex, Caspex, Addon Technologies, Enterprises Solutions, and Softtek. From those, 2 are recruiting agencies and 3 are tech consulting firms.

5.11 What companies offer the highest salaries per data job category? ¶

To answer this question, a heatmap was drawn to show the relationship among company, data job category and average salary. Likewise, the lighter the color, the higher the salary.

To do so, the appropriate dataframe was build using the methods pivot_table and join from Pandas. Moreover, not disclosed employers were removed.

Finally, heatmaps were drawn using both Seaborn and Plotly.

In [ ]:
# Dataframe with averages salaries per company and data job category
top_salary_company_per_data_job = 30
salary_company_per_data_job_df = (pd.pivot_table(data = salary_df, 
                                                 index = 'Company', 
                                                 columns = 'Job', 
                                                 values = 'Avg Salary', 
                                                 aggfunc= 'mean')                      
                                    .fillna(0).sort_index(ascending= True)
                                    .drop('Confidential')
                                    .assign(Average= lambda x: x[x > 0].mean(axis=1, numeric_only= True))
                                    .rename(columns={'Average':'Total Average'})
                                    .sort_values('Total Average', ascending = False)[:top_salary_company_per_data_job]
                                    .sort_index(ascending= True)
                                    )
salary_company_per_data_job_df.head()
Out[ ]:
Job BI Analyst Business Analyst Data Analyst Data Architect Data Engineer Data Scientist ML Engineer Total Average
Company
Addon Technologies Inc. 0.0 0.0 0.0 0.0 77500.0 0.0 0.0 77500.0
Ait Vanguardia Tecnoló... 0.0 0.0 0.0 0.0 43500.0 0.0 0.0 43500.0
Alia 0.0 0.0 62500.0 0.0 0.0 0.0 0.0 62500.0
Caspex Corp 0.0 0.0 77500.0 0.0 0.0 0.0 0.0 77500.0
Cognizant Technology S... 0.0 0.0 0.0 0.0 65000.0 0.0 0.0 65000.0
In [ ]:
# Heatmap of the Companies Offering the Highest Salaries Per Data Job Category with Matplolib
plt.figure(figsize = (8,9))
formatter = mpl.ticker.StrMethodFormatter('${x:,.0f}')
ax = sns.heatmap(data = salary_company_per_data_job_df.drop(columns = 'Total Average'),
                cmap = 'Blues_r',
                mask = (salary_company_per_data_job_df.drop(columns = 'Total Average') == 0),
                cbar_kws={"format": formatter}
                )
ax.set_facecolor('black')
cbar = ax.collections[0].colorbar
cbar.ax.tick_params(labelsize=11)
plt.xlabel('Data Job Category')
plt.ylabel('Company')
plt.title(f'Salary Per Company (Top {top_salary_company_per_data_job}) And Data Job Category\n')
plt.xticks(rotation = 330)
plt.grid(False)
plt.savefig('Figures/Fig15_SalaryPerCompanyAndDataJobCategory.png',  bbox_inches = 'tight')
plt.show()
2024-01-17T10:14:52.923600 image/svg+xml Matplotlib v3.7.4, https://matplotlib.org/
In [ ]:
# Converting dataframe into a tidy format
salary_company_per_data_job_tidy_df = (salary_company_per_data_job_df
                                       .copy()
                                       .drop(columns = 'Total Average')
                                       .reset_index()
                                        )
salary_company_per_data_job_tidy_df = pd.melt(salary_company_per_data_job_tidy_df,
                                            id_vars = 'Company', 
                                            var_name = 'Job', 
                                            value_name = 'Salary')
salary_company_per_data_job_tidy_df.head()
Out[ ]:
Company Job Salary
0 Addon Technologies Inc. BI Analyst 0.0
1 Ait Vanguardia Tecnoló... BI Analyst 0.0
2 Alia BI Analyst 0.0
3 Caspex Corp BI Analyst 0.0
4 Cognizant Technology S... BI Analyst 0.0
In [ ]:
# Heatmap of the Companies Offering the Highest Salaries Per Data Job Category with Plotly
fig = px.density_heatmap(salary_company_per_data_job_tidy_df.sort_values(by = 'Company', ascending = False), 
                          y='Company', 
                          x = 'Job', 
                          z = 'Salary',
                          histfunc="avg", 
                          color_continuous_scale=heatmap_px_colorscale,                          
                          height=1000,
                          width=900, 
                          title= f'<b>Salary Per Company (Top {top_salary_company_per_data_job}) And Data Job Category</b>',
                          labels={'Job': '<b>Data Job Category</b>',
                                  'Company':'<b>Company</b>'}
                          )
fig.update_layout(title_x=0.5, 
                  coloraxis_colorbar=dict(title="<b>Monthly Salary<br>(MXN)</b>"), 
                  font=font_px)
fig.update_xaxes(categoryorder='array', categoryarray= heatmap_px_categoryarray)
fig.update_coloraxes(colorbar_tickformat = '$,~s')
fig.write_image("Figures/Fig15_SalaryPerCompanyAndDataJobCategory2.png", scale=2)
fig.write_image("Figures/Fig15_SalaryPerCompanyAndDataJobCategory2.svg", scale=2)
fig.show(config=config)
BI AnalystBusiness AnalystData AnalystData ArchitectData EngineerData ScientistML EngineerZegovia Rh Sa De CvWork Cast Solutions Sa...Wipro Technologies, S....Uniteller BanorteTotaltechThtec Talento Humano E...SygnoSofttekServicios Axity Mexico...SantanderRandstadPayco Technology S De ...Monterrey It ClusterManpower, S.A. De C.V.Kode It S.C.Johnson ControlsInetum Holding Mexico ...Ids ComercialHuman StaffGrupo Industrial Miner...Grupo AlseaEverisEnterprise Solutions, ...Ecosistemex S. De R.L....ConsissCognizant Technology S...Caspex CorpAliaAit Vanguardia Tecnoló...Addon Technologies Inc.$0$10k$20k$30k$40k$50k$60k$70k$80k$90kMonthly Salary(MXN)Salary Per Company (Top 30) And Data Job CategoryData Job CategoryCompany

So, for BI Analyst positions, the company offering higher salaries are Randstad and Ids Comercial.

For Business Analyst positions, the organizations offering higher salaries are Manpower and Totaltech.

Moreover, for Data Analyst positions, the organizations offering higher salaries are Caspex and Santander.

Furthermore, for Data Architect positions, the organizations offering higher salaries are Softtek and Everis.

In addition, for Data Engineer positions, the organizations offering higher salaries are Manpower, and Addon Technologies.

For Data Scientist positions, the organizations offering higher salaries are Ecosistemex and Enterprises Solutions.

Finally, for ML Engineer positions, the only company with a disclosed salary is Enterprise Solutions.

5.12 What vacancies offer the highest salaries? ¶

To answer this question, a bar chart was drawn to show the vacancies with a salary in the 85th percentile. The original job titles as published in the job website were used for sake of descriptiveness.

To do so, the percentiles were calculated with the percentile function from Numpy. Then the salary dataframe was filtered to select only those vacancies equal or above the 85th percentile.

Finally, a bar chart was drawn using both Seaborn and Plotly.

In [ ]:
# Quantiles calculatuib
percentile = np.percentile(salary_df['Avg Salary'].values, 85)
percentile
Out[ ]:
57500.0
In [ ]:
# Dataframe with vacancies whose salary is within the last decile
top_vacancies = (salary_df[salary_df['Avg Salary'] >= percentile]
                 .drop(columns=['Min Salary', 'Max Salary'])
                 .sort_values('Avg Salary', ascending=False)
                 .reset_index(drop=True)
                 .assign(Salary_k= lambda d: d['Avg Salary']/1000)
                )
top_vacancies
Out[ ]:
Original Job Title Job Company Location Avg Salary Salary_k
0 Científico de Datos Data Scientist Ecosistemex S. De R.L.... Ciudad de México 90000.0 90.0
1 Data Engineer (Databricks) Data Engineer Manpower, S.A. De C.V. Estado de México 85000.0 85.0
2 Data engineer (Databricks) Data Engineer Manpower, S.A. De C.V. Estado de México 85000.0 85.0
3 Analista de datos Sr. Data Analyst Confidential Ciudad de México 80000.0 80.0
4 Sr Data Engineer Data Engineer Addon Technologies Inc. Jalisco 77500.0 77.5
5 SQL Server Data Analyst Data Analyst Caspex Corp Remote/NA 77500.0 77.5
6 Data Engineer with Fivetran developer Data Engineer Enterprise Solutions, ... Remote/NA 75000.0 75.0
7 Data Architect Data Architect Softtek Remote/NA 75000.0 75.0
8 Sr Data Scientist Data Scientist Enterprise Solutions, ... Ciudad de México 75000.0 75.0
9 Analista de datos sr. Data Analyst Confidential Ciudad de México 75000.0 75.0
10 Business Intelligence Program Manager, Advance... BI Analyst Randstad Ciudad de México 70000.0 70.0
11 Arquitecto de Datos Azure Data Architect Monterrey It Cluster Ciudad de México 67500.0 67.5
12 Sr. Business Systems Analyst Business Analyst Confidential Jalisco 67500.0 67.5
13 Arquitecto cloud Datos Data Architect Everis Remote/NA 67500.0 67.5
14 Senior Data Engineer (Qlik) Data Engineer Confidential Jalisco 67500.0 67.5
15 Arquitecto de Datos Azure Data Architect Monterrey It Cluster Ciudad de México 65000.0 65.0
16 Data engineer/SCALA Data Engineer Cognizant Technology S... Jalisco 65000.0 65.0
17 Sr. Data Engineer Data Engineer Inetum Holding Mexico ... Remote/NA 65000.0 65.0
18 Azure Machine Learning – Technical Support Eng... ML Engineer Enterprise Solutions, ... Jalisco 65000.0 65.0
19 Data Engineer Azure Data Engineer Softtek Remote/NA 65000.0 65.0
20 Data Engineer Data Engineer Work Cast Solutions Sa... Remote/NA 65000.0 65.0
21 Data Analyst Data Analyst Santander Ciudad de México 65000.0 65.0
22 Ingeniero de datos / Desarrollador / Hibrido Data Engineer Ids Comercial Ciudad de México 64000.0 64.0
23 Data Analyst Data Analyst Alia Ciudad de México 62500.0 62.5
24 Oracle Business Analyst I Business Analyst Manpower, S.A. De C.V. San Luis Potosí 62500.0 62.5
25 Data Integration Engineer (GCP) Data Engineer Kode It S.C. Ciudad de México 60500.0 60.5
26 INGENIERO SR BASE DE DATOS Data Engineer Grupo Industrial Miner... Ciudad de México 60000.0 60.0
27 Big Data Engineer Sr - Inglés Data Engineer Servicios Axity Mexico... Ciudad de México 57500.0 57.5
28 Arquitecto de Gobierno de datos y empresarial Data Architect Confidential Remote/NA 57500.0 57.5
29 Gerente Analista de Datos / Power BI Data Analyst Human Staff Nuevo León 57500.0 57.5
30 Analista de Datos en Azure (6 meses) Data Analyst Sygno Nuevo León 57500.0 57.5
In [ ]:
# Bar chart of the vacancies with the highest salaries in Matplotlib

fig, ax = plt.subplots(figsize = (5,9))
sns.barplot(data=top_vacancies, 
            x='Salary_k',
            y=top_vacancies.index,                 
            hue='Location',
            palette='viridis',
            legend=True,
            alpha=0.7, 
            dodge=False,
            edgecolor='white',
            errorbar=None,
            ax = ax,
            orient='h')
ax.xaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('${x:,.0f}k'))
ax.set_yticklabels(top_vacancies['Original Job Title'])
ax.set_xlim([min(top_vacancies['Salary_k'])-5, max(top_vacancies['Salary_k'])])

ax2 = ax.secondary_yaxis('right')
ytickslocs = ax.get_yticks()
ax2.set_yticks(ytickslocs, labels=top_vacancies.Company)
ax2.set_ylabel('\nCompany')

plt.legend(facecolor='white')
plt.xlabel('\nAverage Monthly Salary (MXN)')
plt.ylabel('Vacancy')
plt.grid({'both'})
plt.title('Vacancies with the Highest Salaries\n')
plt.savefig('Figures/Fig16_VacanciesHighestSalaries.png',  bbox_inches = 'tight')
plt.show()
C:\Users\dlope\AppData\Local\Temp\ipykernel_2476\4261754965.py:17: UserWarning:

FixedFormatter should only be used together with FixedLocator

2024-01-17T10:14:58.820601 image/svg+xml Matplotlib v3.7.4, https://matplotlib.org/
In [ ]:
# Bar chart of the vacancies with the highest salaries in Plotly

top_vacancies_plotly = top_vacancies.sort_values('Avg Salary', ascending=True).reset_index(drop=True)

subfig = make_subplots(specs=[[{"secondary_y": True}]],
                       
                       )

fig = px.bar(top_vacancies_plotly, 
             x='Avg Salary', 
             y=top_vacancies_plotly.index,
             color='Location',
             color_discrete_sequence=px.colors.sequential.Viridis,
             orientation='h',
             )


fig2 = px.line(top_vacancies_plotly, 
             x=top_vacancies_plotly.index, 
             y=top_vacancies_plotly.index,     
             orientation='h'   
             )

fig2.update_traces(yaxis="y2")

subfig.add_traces(fig.data + fig2.data)

subfig.update_xaxes(tickformat = '$,~s', 
                    range=[min(top_vacancies_plotly['Avg Salary'])-10000, max(top_vacancies_plotly['Avg Salary'])])

subfig.update_layout(
                  title='<b>Vacancies with the Highest Salaries</b>', 
                  xaxis=dict(title='<b>Average Monthly Salary (MXN)</b>'),
                  title_x=0.5, 
                  font=font_px,
                  legend=dict(
                                bgcolor='rgba(255,255,255,0.6)',
                                bordercolor='#cbcccd',
                                borderwidth=1.5,
                                yanchor="bottom",
                                y=0.04,
                                xanchor="right",
                                x=0.90,         
                                itemwidth=30,                                 
                        ),
                  yaxis = dict(
                                title='<b>Vacancy</b>',
                                tickvals=top_vacancies_plotly.index,
                                ticktext=top_vacancies_plotly['Original Job Title'],
                                showgrid=True
                  ),
                  yaxis2 = dict(
                                title='<b>Company</b>',
                                tickvals=top_vacancies_plotly.index,
                                ticktext=top_vacancies_plotly['Company'],
                                showgrid=False,
                                scaleanchor="y",
                                scaleratio=1,
        
                  ),
                  height=900, 
                  width=1000,
                  barmode='stack', 
                  bargap=0.15,
                  )

subfig.update_traces( 
                        marker_line_color='white',
                        marker_line_width=1, 
                        opacity=0.7,
                        )
subfig.write_image("Figures/Fig16_VacanciesHighestSalaries2.png", scale=2)
subfig.write_image("Figures/Fig16_VacanciesHighestSalaries2.svg", scale=2)
subfig.show(config=config)
$50k$60k$70k$80k$90kAnalista de Datos en Azure (6 meses)Arquitecto de Gobierno de datos y empresarialBig Data Engineer Sr - InglésGerente Analista de Datos / Power BIINGENIERO SR BASE DE DATOSData Integration Engineer (GCP)Oracle Business Analyst IData AnalystIngeniero de datos / Desarrollador / HibridoData AnalystData EngineerData Engineer AzureAzure Machine Learning – Technical Support EngineerSr. Data EngineerData engineer/SCALAArquitecto de Datos AzureArquitecto cloud DatosSr. Business Systems AnalystArquitecto de Datos AzureSenior Data Engineer (Qlik)Business Intelligence Program Manager, Advanced EnglishAnalista de datos sr.Sr Data ScientistData ArchitectData Engineer with Fivetran developerSQL Server Data AnalystSr Data EngineerAnalista de datos Sr.Data engineer (Databricks)Data Engineer (Databricks)Científico de DatosSygnoConfidentialServicios Axity Mexico...Human StaffGrupo Industrial Miner...Kode It S.C.Manpower, S.A. De C.V.AliaIds ComercialSantanderWork Cast Solutions Sa...SofttekEnterprise Solutions, ...Inetum Holding Mexico ...Cognizant Technology S...Monterrey It ClusterEverisConfidentialMonterrey It ClusterConfidentialRandstadConfidentialEnterprise Solutions, ...SofttekEnterprise Solutions, ...Caspex CorpAddon Technologies Inc.ConfidentialManpower, S.A. De C.V.Manpower, S.A. De C.V.Ecosistemex S. De R.L....Nuevo LeónRemote/NACiudad de MéxicoSan Luis PotosíJaliscoEstado de MéxicoVacancies with the Highest SalariesAverage Monthly Salary (MXN)VacancyCompany

As can be seen from the plots above, the vacancies within the 85th percentile have an average monthly compensation ranging from $57,500 MXN to $90,000 MXN.

Indeed, the top-paying vacancy at the time of this study was a Data Scientist position at Ecosistemex, located in Mexico City.

As expectable, most of those vacancies correspond to senior or managerial positions. However, it is notewhorthy that most data job categories are represented in the 85th percentile with several senior Data Analyst, Data Engineer, Data Scientists, and BI positions within the top vacancies.

It is remarkable that the vacancies with the highest salaries are located in Mexico City, Estado de México, Jalisco, San Luis Potorí, Nuevo León, and in remote.


6. Statistical Analysis ¶


In this section, an statistical analysis was performed to conclude whether there was any statistical difference among the salary means for each data job category, as well as identifying in which categories the difference was significant as well.

To do so, firstly, the D'Agostino-Pearson normality test was used to assess the normality of the data jobs salary distribution. Then, regardless of results of the normality test, a set of both parametric and non-parametric tests were carried out.

In this sense, the salary means among the data job categories were compared through an ANOVA and a Kruskal-Wallis test.

After that, a post hoc analysis was performed by means of the Tukey-Kramer and Dunn's tests in order to identify which salary differences were, indeed, significant.

Moreover, to confirm the significance of the difference in the salary means for specific pairs of data jobs, the one-sample t-test, two-sample t-test with unequal variance and the Mann-Whitney U test were also performed.

Finally, an effect size analysis was also carried out by computing the absolute mean salary difference, the Cohen's d, and the bootstrap confidence intervals for each data job category. This, in order to assess whether the salary differences are significant not only from a statistical but from a practical point of view.

6.1 How is the salary distribution of the data jobs? ¶

To answer this question, a histogram was drawn with the retrieved salary data using Seaborn and Plotly.

In [ ]:
# Histogram with Seaborn

bins = 8

minh = np.min(salary_df['Avg Salary'])
maxh = np.max(salary_df['Avg Salary'])

ticks = np.linspace(minh,maxh,bins + 1)

plt.figure(figsize = (10, 6))
ax = sns.histplot(data = salary_df, 
                  x = 'Avg Salary', 
                  color = sns.color_palette('Blues_r')[0],
                  bins = bins,
                  alpha = 0.6
                )
ax.xaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('${x:,.0f}'))
plt.xlabel('Average Monthly Salary (MXN)')
plt.ylabel('Frequency')
plt.xticks(ticks)
plt.title('Data Jobs Salary Distribution\n')
plt.grid({'both'})
plt.savefig('Figures/Fig17_DataJobsSalaryDistribution.png',  bbox_inches = 'tight')
plt.show()
2024-01-15T22:13:03.676173 image/svg+xml Matplotlib v3.7.4, https://matplotlib.org/
In [ ]:
# Histogram with Plotly

size = ( maxh - minh )/bins

fig = px.histogram(salary_df, x = 'Avg Salary', 
                   nbins= bins, 
                   opacity=0.5, 
                   color_discrete_sequence = [px.colors.sequential.Blues_r[1]],
                   title= '<b>Data Jobs Salary Distribution</b>',
                   labels={'Avg Salary': '<b>Average Monthly Salary (MXN)</b>'},
                   height = 600,
                   width = 800                   
                   )
fig.update_layout(yaxis_title_text='<b>Frequency</b>', bargap=0.01, title_x=0.5, 
                  font=font_px)
fig.update_traces(xbins=dict( # bins used for histogram
                  start=minh,
                  end=maxh,
                  size=size
                  ))
fig.update_xaxes(tickvals=list(ticks), tickformat = '$,~s')
fig.write_image("Figures/Fig17_DataJobsSalaryDistribution2.png", scale=2)
fig.write_image("Figures/Fig17_DataJobsSalaryDistribution2.svg", scale=2)
fig.show(config=config)
$4k$14.75k$25.5k$36.25k$47k$57.75k$68.5k$79.25k$90k0102030405060Data Jobs Salary DistributionAverage Monthly Salary (MXN)Frequency

Most of the Data Jobs salaries are located in the range from $14,750 MXN to $25,500 MXN per month, whereas monthly salaries superior to $68,500 MXN are more scarce.

On the other hand, it is noteworthy that the salary distributions is skewed to the right. So, it might not comply with the normality assumption.

6.2 Are the salary observations following a normal distribution? ¶

To answer this question, a D’Agostino-Pearson test of normal distribution was carried out using the library Scipy.

$$\mathbf{H_0}: Normal \; Distribution$$

$$\mathbf{H_1}: Not \; H_0$$$$\alpha = 0.05$$
In [ ]:
# D’Agostino-Pearson test of normal distribution
stat, pvalue = stats.normaltest(salary_df['Avg Salary'])
print(f'The K2 statistic is: {stat:.03f}. \nThe p-value is: {pvalue:.03f}.')
The K2 statistic is: 24.808. 
The p-value is: 0.000.

Indeed, as the p-value is significant ($p < 0.05$), the null hypothesis that the sample comes from a normal distribution is rejected.

6.3 How are the salary distributions by data job category? ¶

To answer this question, multiple histograms were drawn with the salary data for each data job category using Matplotlib and Plotly.

In [ ]:
# Multiple histograms with Matplolib
color = sns.color_palette("Blues_r")[0]
edgecolor = 'white'
alpha = 0.5
figure, axis = plt.subplots(nrows=4, ncols=2, figsize = (13,18))
axis[0, 0].hist(salary_df.loc[salary_df['Job'] == 'BI Analyst']['Avg Salary'], color = color, edgecolor = edgecolor, alpha = alpha)
axis[0, 0].set_title("BI Analyst")
axis[0, 0].xaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('${x:,.0f}'))
axis[0, 1].hist(salary_df.loc[salary_df['Job'] == 'Business Analyst']['Avg Salary'], color = color, edgecolor = edgecolor, alpha = alpha)
axis[0, 1].set_title("Business Analyst")
axis[0, 1].xaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('${x:,.0f}'))
axis[1, 0].hist(salary_df.loc[salary_df['Job'] == 'Data Analyst']['Avg Salary'], color = color, edgecolor = edgecolor, alpha = alpha)
axis[1, 0].set_title("Data Analyst")
axis[1, 1].hist(salary_df.loc[salary_df['Job'] == 'Data Architect']['Avg Salary'], color = color, edgecolor = edgecolor, alpha = alpha)
axis[1, 1].set_title("Data Architect")
axis[2, 0].hist(salary_df.loc[salary_df['Job'] == 'Data Engineer']['Avg Salary'], color = color, edgecolor = edgecolor, alpha = alpha)
axis[2, 0].set_title("Data Engineer")
axis[2, 1].hist(salary_df.loc[salary_df['Job'] == 'Data Scientist']['Avg Salary'], color = color, edgecolor = edgecolor, alpha = alpha)
axis[2, 1].set_title("Data Scientist")
axis[3, 0].hist(salary_df.loc[salary_df['Job'] == 'ML Engineer']['Avg Salary'], color = color, edgecolor = edgecolor, alpha = alpha)
axis[3, 0].set_title("ML Engineer")
axis[3, 0].xaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('${x:,.0f}'))
plt.savefig('Figures/Fig18_SalaryDistributionsPerDataJobCategory.png',  bbox_inches = 'tight')
plt.show()
2024-01-15T22:13:12.799174 image/svg+xml Matplotlib v3.7.4, https://matplotlib.org/
In [ ]:
# Multiple histograms with Plotly
bins = 10
fig = make_subplots(
    rows=4, cols=2,
    subplot_titles=("<b>BI Analyst</b>", "<b>Data Analyst</b>", "<b>Business Analyst</b>", "<b>Data Architect</b>", "<b>Data Engineer</b>", "<b>Data Scientist</b>", "<b>ML Engineer</b>"),
    specs=[[{"colspan": 2}, None],
           [{}, {}],                 
           [{}, {}],
           [{}, {}]]
    )

fig.add_trace(go.Histogram(x=salary_df.loc[salary_df['Job'] == 'BI Analyst']['Avg Salary'], nbinsx= bins),
              row=1, col=1)

fig.add_trace(go.Histogram(x=salary_df.loc[salary_df['Job'] == 'Data Analyst']['Avg Salary'], nbinsx= bins),
              row=2, col=1)

fig.add_trace(go.Histogram(x=salary_df.loc[salary_df['Job'] == 'Business Analyst']['Avg Salary'], nbinsx= bins),
              row=2, col=2)

fig.add_trace(go.Histogram(x=salary_df.loc[salary_df['Job'] == 'Data Architect']['Avg Salary'], nbinsx= bins),
              row=3, col=1)

fig.add_trace(go.Histogram(x=salary_df.loc[salary_df['Job'] == 'Data Engineer']['Avg Salary'], nbinsx= bins),
              row=3, col=2)

fig.add_trace(go.Histogram(x=salary_df.loc[salary_df['Job'] == 'Data Scientist']['Avg Salary'], nbinsx= bins),
              row=4, col=1)

fig.add_trace(go.Histogram(x=salary_df.loc[salary_df['Job'] == 'ML Engineer']['Avg Salary'], nbinsx= bins),
              row=4, col=2)

fig.update_layout(height=1300, 
                  width=800, 
                  font=font_px,
                  title_text="<b>Salary Distributions Per Data Job Category</b>",
                  title_x=0.5,
                  bargap=0.02 
                  )
fig.update_traces(marker_color=px.colors.sequential.Blues[7], 
                  marker_line_color='white', 
                  marker_line_width=1, 
                  opacity = 0.5, 
                  showlegend=False)
fig.update_xaxes(tickformat = '$,~s')
fig.update_xaxes(title_font=dict(size=14), title_text="Average Monthly Salary (MXN)", row=1, col=1)
fig.update_xaxes(title_font=dict(size=14), title_text="Average Monthly Salary (MXN)", row=2, col=1)
fig.update_xaxes(title_font=dict(size=14), title_text="Average Monthly Salary (MXN)", row=2, col=2)
fig.update_xaxes(title_font=dict(size=14), title_text="Average Monthly Salary (MXN)", row=3, col=1)
fig.update_xaxes(title_font=dict(size=14), title_text="Average Monthly Salary (MXN)", row=3, col=2)
fig.update_xaxes(title_font=dict(size=14), title_text="Average Monthly Salary (MXN)", row=4, col=1)
fig.update_xaxes(title_font=dict(size=14), title_text="Average Monthly Salary (MXN)", row=4, col=2)
fig.update_yaxes(title_font=dict(size=14), title_text="Frequency", row=1, col=1)
fig.update_yaxes(title_font=dict(size=14), title_text="Frequency", row=2, col=1)
fig.update_yaxes(title_font=dict(size=14), title_text="Frequency", row=2, col=2)
fig.update_yaxes(title_font=dict(size=14), title_text="Frequency", row=3, col=1)
fig.update_yaxes(title_font=dict(size=14), title_text="Frequency", row=3, col=2)
fig.update_yaxes(title_font=dict(size=14), title_text="Frequency", row=4, col=1)
fig.update_yaxes(title_font=dict(size=14), title_text="Frequency", row=4, col=2)
fig.write_image("Figures/Fig18_SalaryDistributionsPerDataJobCategory2.png", scale=2)
fig.write_image("Figures/Fig18_SalaryDistributionsPerDataJobCategory2.svg", scale=2)
fig.show(config=config)
$0$10k$20k$30k$40k$50k$60k$70k$80k02468$0$20k$40k$60k$80k0102030$0$20k$40k$60k0246$20k$40k$60k$80k0123$20k$40k$60k$80k02468$20k$40k$60k$80k$100k01234$64.9995k$65k$65.0005k00.20.40.60.81Salary Distributions Per Data Job CategoryAverage Monthly Salary (MXN)Average Monthly Salary (MXN)Average Monthly Salary (MXN)Average Monthly Salary (MXN)Average Monthly Salary (MXN)Average Monthly Salary (MXN)Average Monthly Salary (MXN)FrequencyFrequencyFrequencyFrequencyFrequencyFrequencyFrequencyBI AnalystData AnalystBusiness AnalystData ArchitectData EngineerData ScientistML Engineer

Indeed, only the salary observations for the BI Analyst, Business Analyst, and Data Engineer positions somewhat resemble a normal distribution, whereas it is not possible to state the same with the observations for the other data job categories. Thus a normal assumption cannot be hold.

Nonetheless, for the purposes of the present study, both parametric (ANOVA, Tukey-Kramer and T-test with unequal variance) and non-parametric (Kruskal-Wallis H, Dunn and Mann-Whitney U) tests were carried out to assess the significance of the obtained results.

6.4 Are the salary differences among Data Jobs statistically significant? ¶

To answer this question, an ANOVA and a Kruskal-Wallis H test were performed on the salary data using Scipy and Statsmodels.

$$\mathbf{H_0}: \mu_{BI} = \mu_{BA} = \mu_{DA} = \mu_{DR} = \mu_{DE} = \mu_{DS} = \mu_{ML} $$$$\mathbf{H_1}: Not \; H_0$$$$\alpha = 0.05$$

First, the salary observations for each data job category were prepared from the dataset using the pivot_table method from Pandas.

In [ ]:
# Salary pivot table by data job category
salary_pivot_df = pd.pivot_table(salary_df, index = 'Company', columns = 'Job', values = 'Avg Salary', aggfunc= 'mean')
salary_pivot_df.head()
Out[ ]:
Job BI Analyst Business Analyst Data Analyst Data Architect Data Engineer Data Scientist ML Engineer
Company
5-18 Consulting Group ... NaN NaN 30500.000000 NaN NaN NaN NaN
Addon Technologies Inc. NaN NaN NaN NaN 77500.0 NaN NaN
Adecco 55000.0 NaN 18333.166667 NaN 27500.0 NaN NaN
Ait Vanguardia Tecnoló... NaN NaN NaN NaN 43500.0 NaN NaN
Alia NaN NaN 62500.000000 NaN NaN NaN NaN
In [ ]:
salary_pivot_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 120 entries, 5-18 Consulting Group ... to Zegovia Rh Sa De Cv
Data columns (total 7 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   BI Analyst        20 non-null     float64
 1   Business Analyst  23 non-null     float64
 2   Data Analyst      64 non-null     float64
 3   Data Architect    6 non-null      float64
 4   Data Engineer     30 non-null     float64
 5   Data Scientist    8 non-null      float64
 6   ML Engineer       1 non-null      float64
dtypes: float64(7)
memory usage: 7.5+ KB

It is important to bear in mind that only one salary observations was retrieved for a ML Engineer position.

Then, a Numpy array was defined for each data job category and the NaN values were dropped.

In [ ]:
# Array of salary observations by data job category
BI = salary_pivot_df['BI Analyst'].dropna().values
BA = salary_pivot_df['Business Analyst'].dropna().values
DA = salary_pivot_df['Data Analyst'].dropna().values
DR = salary_pivot_df['Data Architect'].dropna().values
DE = salary_pivot_df['Data Engineer'].dropna().values
DS = salary_pivot_df['Data Scientist'].dropna().values
ML = salary_pivot_df['ML Engineer'].dropna().values

Then, the parametric ANOVA test was carried out:

In [ ]:
# ANOVA with outliers
stat, pvalue = stats.f_oneway(BI, BA, DA, DR, DE, DS, ML)

print(f'The F statistic is: {stat:.03f}. \nThe p-value is: {pvalue:.03f}.\n')

alpha = 0.05
if pvalue > alpha:
	print('Test interpretation: Fail to reject H0.')
else:
	print('Test interpretation: Reject H0.')
The F statistic is: 8.657. 
The p-value is: 0.000.

Test interpretation: Reject H0.
In [ ]:
# ANOVA using the Statsmodels Formula API
model = smf.ols(formula = "Salary ~ C(Job)", data = salary_df.rename(columns = {'Avg Salary': 'Salary'})).fit()
anova_table = sm.stats.anova_lm(model, typ=2)
anova_table
Out[ ]:
sum_sq df F PR(>F)
C(Job) 1.924338e+10 6.0 11.149694 1.200306e-10
Residual 5.465386e+10 190.0 NaN NaN

Moreover, the non-parametric Kruskal-Wallis H test was also performed:

In [ ]:
# Kruskal-Wallis H Test
stat, pvalue = stats.kruskal(BI, BA, DA, DR, DE, DS, ML)

print(f'The H statistic is: {stat:.03f}. \nThe p-value is: {pvalue:.03f}.\n')

alpha = 0.05
if pvalue > alpha:
	print('Test interpretation: Fail to reject H0.')
else:
	print('Test interpretation: Reject H0.')
The H statistic is: 42.662. 
The p-value is: 0.000.

Test interpretation: Reject H0.

The obtained p-values from both tests are significant ($p < 0.05$). Therefore, the salary differences among the data job categories are statistically significant.

Therefore, the next step is to identify which salary differences among the data jobs are significant.

6.5 Which salary differences among Data Jobs are statistically significant? ¶

To answer this question, the Tukey-Kramer and Dunn's post hoc tests were performed on the salary data using Scikit_posthocs.

So, firstly, the data was tranformed into a long format in order to be able to the processed by the library.

In [ ]:
# Preparing dataframe in a tidy format
salary_tidy_df = pd.melt(salary_pivot_df, var_name='Job', value_name='Salary').dropna().reset_index(drop=True)
salary_tidy_df.head()
Out[ ]:
Job Salary
0 BI Analyst 55000.0
1 BI Analyst 27500.0
2 BI Analyst 17800.0
3 BI Analyst 27500.0
4 BI Analyst 17500.0

Then, the post hoc tests were carried out.

In [ ]:
# Tukey-Kramer Test
tukey_df = posthoc_tukey(salary_tidy_df, val_col='Salary', group_col='Job')
tukey_df
Out[ ]:
BI Analyst Business Analyst Data Analyst Data Architect Data Engineer Data Scientist ML Engineer
BI Analyst 1.000000 0.900000 0.824031 0.030082 0.027678 0.374142 0.392984
Business Analyst 0.900000 1.000000 0.505134 0.053670 0.062074 0.524874 0.460989
Data Analyst 0.824031 0.505134 1.000000 0.001000 0.001000 0.026352 0.200257
Data Architect 0.030082 0.053670 0.001000 1.000000 0.871216 0.900000 0.900000
Data Engineer 0.027678 0.062074 0.001000 0.871216 1.000000 0.900000 0.900000
Data Scientist 0.374142 0.524874 0.026352 0.900000 0.900000 1.000000 0.900000
ML Engineer 0.392984 0.460989 0.200257 0.900000 0.900000 0.900000 1.000000
In [ ]:
# Dunn's Test
dunn_df = posthoc_dunn(salary_tidy_df, val_col='Salary', group_col='Job')
dunn_df
Out[ ]:
BI Analyst Business Analyst Data Analyst Data Architect Data Engineer Data Scientist ML Engineer
BI Analyst 1.000000 0.755463 6.152017e-02 0.019224 1.250973e-02 0.130236 0.157206
Business Analyst 0.755463 1.000000 1.818269e-02 0.030046 2.396332e-02 0.190131 0.184929
Data Analyst 0.061520 0.018183 1.000000e+00 0.000239 5.860401e-08 0.003024 0.055678
Data Architect 0.019224 0.030046 2.385867e-04 1.000000 4.095596e-01 0.397706 0.739110
Data Engineer 0.012510 0.023963 5.860401e-08 0.409560 1.000000e+00 0.825087 0.473576
Data Scientist 0.130236 0.190131 3.024004e-03 0.397706 8.250873e-01 1.000000 0.441439
ML Engineer 0.157206 0.184929 5.567757e-02 0.739110 4.735764e-01 0.441439 1.000000

To easily visualize the results from the post hoc tests, they p-values matrices were converted to boolean values.

In [ ]:
alpha = 0.05
tukey_df < alpha
Out[ ]:
BI Analyst Business Analyst Data Analyst Data Architect Data Engineer Data Scientist ML Engineer
BI Analyst False False False True True False False
Business Analyst False False False False False False False
Data Analyst False False False True True True False
Data Architect True False True False False False False
Data Engineer True False True False False False False
Data Scientist False False True False False False False
ML Engineer False False False False False False False
In [ ]:
dunn_df < alpha
Out[ ]:
BI Analyst Business Analyst Data Analyst Data Architect Data Engineer Data Scientist ML Engineer
BI Analyst False False False True True False False
Business Analyst False False True True True False False
Data Analyst False True False True True True False
Data Architect True True True False False False False
Data Engineer True True True False False False False
Data Scientist False False True False False False False
ML Engineer False False False False False False False

Thus, according to the results above and significance level of $\alpha = 0.05$ :

  • The salary mean for BI Analyst positions is significantly different from those for Data Architect, and Data Engineer positions.
  • The salary mean for Business Analyst positions is significantly different from those for Data Analyst, Data Architect, Data Engineer, and Data Scientist positions.
  • The salary mean for Data Analyst positions is significantly different from those for Business Analyst, Data Architect, Data Engineer, and Data Scientist positions.
  • The salary mean for Data Architect positions is significantly different from those for BI Analyst, Business Analyst, and Data Analyst positions.
  • The salary mean for Data Engineer positions is significantly different from those for BI Analyst, Business Analyst, and Data Analyst positions.
  • The salary mean for Data Scientist positions is significantly different from those for Data Analyst positions.
  • Finally, as only one salary observation was retrieved for ML Engineer positions, no appropriate comparisons could be carried out by the post hoc tests.

In this context, specific data jobs categories were later compared using the one-sample t-test, the two-sample T-test with unequal variance, the Wilcoxon signed-rank test and the Mann-Whitney U test to further confirm the significance of the above results.

The one-sample t-test and the Wilcoxon signed-rank test were used when only one salary observation was collected for a data job category, i.e., ML Engineer positions; whereas the two-sample t-test with unequal variance and the Mann-Whitney U test were performed whenever the sample size of salaries for both data job categories under comparison was larger than 1, i.e., the rest of data job categories.

Thus, a function to compare the means of the different job categories was defined as follows:

In [ ]:
# Function to compare the mean salaries using a T-test and the Mann-Whitney U test
def compare_means(sample1, sample2, alternative, equal_var = False, alpha = 0.05):
    """
    This function performs both parametric and non-parametrics tests to compare means.
    
    When having two independent samples, the two-sample T-test and the Mann-Whitney U test are performed.
    
    On the other hand, when having only one sample, the one-sample T-test and the Wilcoxon signed-rank 
    test are performed.

    Parameters

    sample1: First sample array or single value.

    sample2: Second sample array or single value.

    alternative: Alternative hypothesis. If 'greater', the mean of the distribution 
    underlying the first sample is greater than the mean of the distribution underlying 
    the second sample. If 'less', the mean of the distribution underlying the first 
    sample is less than the mean of the distribution underlying the second sample. 
    Finally, if ‘two-sided’, the means of the distributions underlying the samples are unequal.
    
    equal_var: If True, the test performs a standard independent 2 sample test that 
    assumes equal population variances. If False (default), perform Welch’s t-test, 
    which does not assume equal population variance.
    
    alpha: Significance level. It is 0.05 by default.

    Returns

    stat_t: The calculated t-statistic.

    pvalue_t: The associated p-value for the chosen alternative from the t-test.

    stat_np: The non-parametric statistic corresponding with the first sample (either 
    the Mann-Whitney U or the Wilcoxon signed-rank T).

    pvalue_np: The associated p-value for the chosen alternative from the non-parametric test.

    """
    # Parametric and non parametric tests
    if len(sample1) == 1:        

        # One-sample T-Test
        stat_t, pvalue_t = stats.ttest_1samp(popmean = sample1, a = sample2, alternative = alternative)
        
        # Wilcoxon signed-rank test
        stat_np, pvalue_np = stats.wilcoxon((np.array(sample2) - sample1), alternative = alternative, method='auto')
                
        print(f'T-test: The t statistic is {stat_t:.03f}; and the p-value is {pvalue_t:.03f}.\n')
        print(f'Wilcoxon signed-rank test: The T statistic is {stat_np:.03f}; and the p-value is {pvalue_np:.03f}.\n')

    
    elif len(sample2) == 1:

        # One-sample T-Test
        stat_t, pvalue_t = stats.ttest_1samp(popmean = sample2, a = sample1, alternative = alternative)

        # Wilcoxon signed-rank test
        stat_np, pvalue_np = stats.wilcoxon((np.array(sample1) - sample2), alternative = alternative, method='auto')
        
        # Print of results
        print(f'T-test: The t statistic is {stat_t:.03f}; and the p-value is {pvalue_t:.03f}.\n')
        print(f'Wilcoxon signed-rank test: The T statistic is {stat_np:.03f}; and the p-value is {pvalue_np:.03f}.\n')
       
    else:

        # Two-sample T-Test
        stat_t, pvalue_t = stats.ttest_ind(a = sample1, b = sample2, equal_var = equal_var, alternative = alternative)

        # Mann-Whitney U test
        stat_np, pvalue_np = stats.mannwhitneyu(x = sample1, y = sample2, alternative = alternative)
        
        # Print of results
        print(f'T-test: The t statistic is {stat_t:.03f}; and the p-value is {pvalue_t:.03f}.\n')
        print(f'Mann-Whitney U test: The U statistic is {stat_np:.03f}; and the p-value is {pvalue_np:.03f}.\n')

    # Print of results
    
    
    # Interpretation
    alpha = 0.05
    if pvalue_t > alpha and pvalue_np > alpha:
      print('Test interpretation: Fail to reject H0.\n')
    elif pvalue_t < alpha and pvalue_np < alpha:
      print('Test interpretation: Reject H0.\n')
    else:
        print('Inconsistent results between the tests.\n')
    
    return stat_t, pvalue_t, stat_np, pvalue_np

Furthermore, the difference between the mean salaries, or absolute effect size, for each data job category was calculated in order to quantify the size of the effect as well as its percentage difference, calculated by dividing the absolute effect size by the average of the mean salaries and multiplying by 100.

Moreover, the confidence intervals for the mean were estimated by means of bootstrapping, as the salary observations do not follow a Gaussian distribution. Nonetheless, for sake of completeness, Cohen's d was also estimated.

Likewise, as the effect size was estimated several times for each data job salary comparison, several functions were defined as follows:

In [ ]:
# Function for estimating the Absolute Effect Size
def absolute_effect_size(sample1, sample2):
    """
    This functions estimates the effect size of the two mean samples by calculating the 
    absolute difference between them and its percentage difference.

    Parameters

    sample1: First sample array.

    sample2: Second sample array.

    Returns

    abs_effect_size: The absolute effect size of the two mean samples.

    percen_dif: The percentage difference of the two mean samples.

    """
    mean_sample1 = np.mean(sample1)

    mean_sample2 = np.mean(sample2)

    abs_effect_size = abs(mean_sample1 - mean_sample2) 

    percen_dif = (abs_effect_size / ((mean_sample1 + mean_sample2)/2)) * 100
    
    return abs_effect_size, percen_dif
In [ ]:
# Function for estimating the Standardized Effect Size (Cohen's d)
def cohens_d(sample1, sample2):
    """
    This functions estimates the standardized effect size of the two mean samples 
    by calculating the Cohen's d.

    Parameters

    sample1: First sample array.

    sample2: Second sample array.

    Returns

    d: The standardized effect size of the two mean samples (Cohen's d).

    """

    # Lenght of the samples
    n1, n2 = len(sample1), len(sample2)

    # Variances of the samples
    s1, s2 = np.var(sample1, ddof=1), np.var(sample2, ddof=1)

    # Pooled standard deviation
    s = np.sqrt(((n1 - 1) * s1 + (n2 - 1) * s2) / (n1 + n2 - 2))

    # Mean of the samples
    u1, u2 = np.mean(sample1), np.mean(sample2)

    # Cohen's d
    d = (u1 - u2) / s
    
    return d
In [ ]:
# Function for estimating the bootstrap confidence intervals
def boot_conf_int(sample, alpha = 0.05, iterations = 1000):
    """
    This functions estimates the bootstrap confidence intervals of a sample mean.

    Parameters

    sample: Sample array.
       
    alpha: Significance level. It is 0.05 by default.

    iterantions: Number of iterations. It is 1000 by default.

    Returns

    stats: List with the estimated statistics.

    lower_ci: Lower confidence interval.

    upper_ci: Upper confidence interval.

    
    """

    n_iterations = iterations
    n_size = int(len(sample) * 0.50)
    stats = []

    for i in range(n_iterations):
        new_sample = resample(sample, n_samples=n_size)
        mean_stat = np.mean(new_sample)
        stats.append(mean_stat)

    p = (alpha / 2) * 100
    lower_ci = np.percentile(stats, p)
    p = ((1 - alpha) + (alpha / 2)) * 100
    upper_ci = np.percentile(stats, p)      
 
    return stats, lower_ci, upper_ci
In [ ]:
# Function for estimating the Effect Size
def effect_size(sample1, sample2, legend1 = "Sample 1", legend2 = "Sample 2", alpha = 0.05):
    """
    This functions estimates the effect size of the two mean samples by calculating the 
    absolute effect size, the Cohen's d, and the bootstrap confidence intervals.

    Parameters

    sample1: First sample array.

    sample2: Second sample array.

    legend1: String with the legend for the sample 1.

    legend2: String with the legend for the sample 2.
    
    alpha: Significance level. It is 0.05 by default.

    Returns

    None

    """

    # Absolute Effect Size
    abs_effect_size, percen_dif = absolute_effect_size(sample1, sample2)
    print(f'The mean salary difference and the percentage difference between the two data job categories is: ${abs_effect_size:,.0f} and {percen_dif:,.2f}%, respectively.\n')    

    if len(sample1) > 1 and len(sample2) > 1:

        # Standardized Effect Size (Cohen's d)
        d = cohens_d(sample1, sample2)

        # Interpretation of Cohen's d
        print(f"The Cohen's d between the two data job categories is: {d:,.2f}.")
        if d <= 0.20:
          print(f"(Small Effect Size)\n")
        elif d <= 0.50:
          print(f"(Medium Effect Size)\n")
        else:
          print(f"(Large Effect Size)\n")
        
        # Legends list
        legends = [legend1, legend2]

        # Adding plot
        fig, ax = plt.subplots(figsize = (9, 5))

        for index, sample in enumerate([sample1, sample2]):
          
          # Bootstrap confidence intervals
          stats, lower_ci, upper_ci = boot_conf_int(sample, alpha = alpha, iterations = 1000)

          print(f"{legends[index]}'s Salary --> {(1 - alpha)*100}% Confidence Interval: (${lower_ci:,.0f}, ${upper_ci:,.0f})")

          
          ax.hist(stats, color = sns.color_palette('Blues_r')[3-index],
                  alpha = 0.65, label = f'{legends[index]}')  
        
        print('\n')
        plt.title(f'Bootstrap Distributions for {legend1} and {legend2} Salaries\n')
        plt.xlabel('Average Monthly Salary (MXN)')
        plt.ylabel('Frequency')    
        ax.xaxis.set_major_formatter(mpl.ticker.StrMethodFormatter('${x:,.0f}'))   
        plt.legend(loc = 'upper right', facecolor = 'white')
        plt.savefig(f'Figures/Fig_BootstrapDistributions_{legend1}-{legend2}.png',  bbox_inches = 'tight')
        plt.show()

6.6 Are the Data Architect salaries significantly lower than those for ML Engineers? ¶

To answer this question, the one-sample T-test and the Wilcoxon signed-rank test were performed based on the salary observations for that pair of data jobs categories. These tests were selected instead of the two-sample T-Test with unequal variance and the Mann-Whitney U test since only one salary observation was collected for ML engineer positions. Thus, said salary value served as the mean of the population, just for sake of comparison.

Furthermore, a effect size analysis was carried out to estimate the absolute salary difference and the percentage difference between the two data job categories.

$$\mathbf{H_0}: \mu_{DR} ≥ \mu_{ML}$$$$\mathbf{H_1}: \mu_{DR} < \mu_{ML}$$$$\alpha = 0.05$$

Moreover, as the one-sample T-test and the Wilcoxon signed-rank test were carried out to compare the mean salaries of the two data jobs categories:

In [ ]:
# Hypothesis testing
stat_t, pvalue_t, stat_T, pvalue_T = compare_means(DR, ML, alternative = 'less', alpha = 0.05)
T-test: The t statistic is -1.360; and the p-value is 0.116.

Wilcoxon signed-rank test: The T statistic is 6.000; and the p-value is 0.219.

Test interpretation: Fail to reject H0.

The obtained p-values from both tests are not significant ($p > 0.05$). Therefore, the Data Architect salaries are not significantly lower than that for ML Engineers.

However, as only one salary observation was retrieved for ML engineer positions, just the absolute effect size and the percentage difference was calculated for this case:

In [ ]:
# Estimating the Effect Size
effect_size(DR, ML)
The mean salary difference and the percentage difference between the two data job categories is: $10,375 and 17.35%, respectively.

So, from the present analysis, it is possible to conclude that the mean salary difference between ML Engineer and Data Architect positions is neither statistically nor practically significant.

Certainly, a difference of $10,375 MXN per month, or a percentage difference of 17.35%, is not that important when the salary observation for a ML Engineer is $65,000 MXN and the mean salary for **Data Architect** positions is about $54,625 MXN.

Indeed, according to the Reporte del Mercado Laboral de TI México 2023 (Spanish for Mexico IT Labor Market Report 2023) by Olvera (2023), only about 20% of the surveyed IT professionals in Mexico would be willing to a job change for a salary increase of 20% or less; whereas the 79% would be willing to a job change for a salary increase of 20% or more. So, as a heuristic rule, a percentage difference of 20% was selected as a cut-off value for practical significance.

6.7 Are the Data Engineer salaries significantly lower than those for ML Engineers? ¶

To answer this question, the one-sample T-test and the Wilcoxon signed-rank test were performed based on the salary observations for that pair of data jobs categories. Furthermore, a effect size analysis was carried out to estimate the absolute salary difference and the percentage difference between the two data job categories.

$$\mathbf{H_0}: \mu_{DE} ≥ \mu_{ML}$$$$\mathbf{H_1}: \mu_{DE} < \mu_{ML}$$$$\alpha = 0.05$$
In [ ]:
# Hypothesis testing
stat_t, pvalue_t, stat_T, pvalue_T = compare_means(DE, ML, alternative = 'less', alpha = 0.05)
T-test: The t statistic is -5.620; and the p-value is 0.000.

Wilcoxon signed-rank test: The T statistic is 18.500; and the p-value is 0.000.

Test interpretation: Reject H0.

d:\DataJobsMX-Nov2023\.venv\lib\site-packages\scipy\stats\_morestats.py:3414: UserWarning:

Exact p-value calculation does not work if there are zeros. Switching to normal approximation.

The obtained p-values from both tests are significant ($p < 0.05$). Therefore, the Data Engineer salaries are significantly lower than that for ML Engineers.

In [ ]:
# Estimating the Effect Size
effect_size(DE, ML)
The mean salary difference and the percentage difference between the two data job categories is: $19,612 and 35.53%, respectively.

So, from the effect size analysis, it is possible to conclude that the mean salary difference between ML Engineer and Data Engineer positions is actually practically significant. Indeed, a difference of $19,612 MXN per month or a percentage difference of 35.53% is important in the Mexican labor market.

Thus, even though the salary difference was found to be not statistically significant, practically it was. That is the reason why an effect size analysis should be carried out along with the hypothesis testing.

6.8 Are the Data Scientist salaries significantly lower than those for ML Engineers? ¶

To answer this question, the one-sample T-test and the Wilcoxon signed-rank test were performed based on the salary observations for that pair of data jobs categories. Furthermore, a effect size analysis was carried out to estimate the absolute salary difference and the percentage difference between the two data job categories.

$$\mathbf{H_0}: \mu_{DS} ≥ \mu_{ML}$$$$\mathbf{H_1}: \mu_{DS} < \mu_{ML}$$$$\alpha = 0.05$$
In [ ]:
# Hypothesis testing
stat_t, pvalue_t, stat_T, pvalue_T = compare_means(ML, DS, alternative = 'less', alpha = 0.05)
T-test: The t statistic is -2.311; and the p-value is 0.027.

Wilcoxon signed-rank test: The T statistic is 3.000; and the p-value is 0.020.

Test interpretation: Reject H0.

The obtained p-values from both tests are significant ($p < 0.05$). Therefore, the Data Scientist salaries are significantly lower than that for ML Engineers.

In [ ]:
# Estimating the Effect Size
effect_size(DS, ML)
The mean salary difference and the percentage difference between the two data job categories is: $20,531 and 37.51%, respectively.

So, from the effect size analysis, it is possible to conclude that the mean salary difference between ML Engineer and Data Scientist positions is actually practically significant. Indeed, a difference of $20,531 MXN per month or a percentage difference of 37.51% is important in the Mexican labor market.

6.9 Are the Business Analyst salaries significantly lower than those for ML Engineers? ¶

To answer this question, the one-sample T-test and the Wilcoxon signed-rank test were performed based on the salary observations for that pair of data jobs categories. Furthermore, a effect size analysis was carried out to estimate the absolute salary difference and the percentage difference between the two data job categories.

$$\mathbf{H_0}: \mu_{BA} ≥ \mu_{ML}$$$$\mathbf{H_1}: \mu_{BA} < \mu_{ML}$$$$\alpha = 0.05$$
In [ ]:
# Hypothesis testing
stat_t, pvalue_t, stat_T, pvalue_T = compare_means(ML, BA, alternative = 'less', alpha = 0.05)
T-test: The t statistic is -10.259; and the p-value is 0.000.

Wilcoxon signed-rank test: The T statistic is 1.500; and the p-value is 0.000.

Test interpretation: Reject H0.

The obtained p-values from both tests are significant ($p > 0.05$). Therefore, the Business Analyst salaries are significantly lower than that for ML Engineers.

In [ ]:
# Estimating the Effect Size
effect_size(BA, ML)
The mean salary difference and the percentage difference between the two data job categories is: $33,107 and 68.34%, respectively.

So, from the effect size analysis, it is possible to conclude that the mean salary difference between ML Engineer and Business Analyst positions is actually practically significant. Indeed, a difference of $33,107 MXN per month or a percentage difference of 68.34% is remarkable in the Mexican labor market.

6.10 Are the BI Analyst salaries significantly lower than those for ML Engineers? ¶

To answer this question, the one-sample T-test and the Wilcoxon signed-rank test were performed based on the salary observations for that pair of data jobs categories. Furthermore, a effect size analysis was carried out to estimate the absolute salary difference and the percentage difference between the two data job categories.

$$\mathbf{H_0}: \mu_{BI} ≥ \mu_{ML}$$$$\mathbf{H_1}: \mu_{BI} < \mu_{ML}$$$$\alpha = 0.05$$
In [ ]:
# Hypothesis testing
stat_t, pvalue_t, stat_T, pvalue_T = compare_means(ML, BI, alternative = 'less', alpha = 0.05)
T-test: The t statistic is -11.271; and the p-value is 0.000.

Wilcoxon signed-rank test: The T statistic is 1.000; and the p-value is 0.000.

Test interpretation: Reject H0.

The obtained p-values from both tests are significant ($p < 0.05$). Therefore, the BI Analyst salaries are significantly lower than that for ML Engineers.

In [ ]:
# Estimating the Effect Size
effect_size(ML, BI)
The mean salary difference and the percentage difference between the two data job categories is: $35,053 and 73.84%, respectively.

Thus, from the effect size analysis, it is possible to conclude that the mean salary difference between ML Engineer and BI Analyst positions is actually practically significant. Indeed, a difference of $35,053 MXN per month or a percentage difference of 73.84% is remarkable in the Mexican labor market.

6.11 Are the Data Analyst salaries significantly lower than those for ML Engineers? ¶

To answer this question, the one-sample T-test and the Wilcoxon signed-rank test were performed based on the salary observations for that pair of data jobs categories. Furthermore, a effect size analysis was carried out to estimate the absolute salary difference and the percentage difference between the two data job categories.

$$\mathbf{H_0}: \mu_{DA} ≥ \mu_{ML}$$$$\mathbf{H_1}: \mu_{DA} < \mu_{ML}$$$$\alpha = 0.05$$
In [ ]:
# Hypothesis testing
stat_t, pvalue_t, stat_T, pvalue_T = compare_means(DA, ML, alternative = 'less', alpha = 0.05)
T-test: The t statistic is -21.190; and the p-value is 0.000.

Wilcoxon signed-rank test: The T statistic is 4.000; and the p-value is 0.000.

Test interpretation: Reject H0.

The obtained p-values from both tests are significant ($p < 0.05$). Therefore, the Data Analyst salaries are significantly lower than that for ML Engineers.

In [ ]:
# Estimating the Effect Size
effect_size(DA, ML)
The mean salary difference and the percentage difference between the two data job categories is: $40,693 and 91.13%, respectively.

Thus, from the effect size analysis, it is possible to conclude that the mean salary difference between ML Engineer and BI Analyst positions is actually practically significant. Indeed, a difference of $40,693 MXN per month or a percentage difference of 91.13% is very important in the Mexican labor market.

6.12 Are the Data Architect salaries significantly higher than those for Data Engineers? ¶

To answer this question, the two-sample T-test with unequal variance and the Mann-Whitney U test were performed based on the salary observations. Furthermore, a effect size analysis was carried out to estimate the absolute salary difference and the percentage difference between the two data job categories, the corresponding Cohen's d; as well as their bootstrap confidence intervals.

$$\mathbf{H_0}: \mu_{DR} ≤ \mu_{DE}$$$$\mathbf{H_1}: \mu_{DR} > \mu_{DE}$$$$\alpha = 0.05$$
In [ ]:
# Hypothesis testing
stat_t, pvalue_t, stat_u, pvalue_u = compare_means(DR, DE, alternative = 'greater', alpha = 0.05)
T-test: The t statistic is 1.101; and the p-value is 0.153.

Mann-Whitney U test: The U statistic is 119.000; and the p-value is 0.113.

Test interpretation: Fail to reject H0.

The obtained p-values from both tests are not significant ($p > 0.05$). Therefore, the Data Architect salaries are not significantly higher than those for Data Engineers.

In [ ]:
# Estimating the Effect Size
effect_size(DR, DE, "Data Architect", "Data Engineer", alpha = 0.05)
The mean salary difference and the percentage difference between the two data job categories is: $9,238 and 18.47%, respectively.

The Cohen's d between the two data job categories is: 0.48.
(Medium Effect Size)

Data Architect's Salary --> 95.0% Confidence Interval: ($33,750, $72,083)
Data Engineer's Salary --> 95.0% Confidence Interval: ($36,472, $54,384)


2024-01-15T22:13:24.111173 image/svg+xml Matplotlib v3.7.4, https://matplotlib.org/

So, from the effect size analysis, it is possible to conclude that the mean salary difference between Data Architect and Data Engineer positions is neither statistically nor practically significant.

Certainly, a difference of $9,238 MXN per month and a percentage difference of about 18.47% is not that huge in the Mexican labor market. And, in this sense, it is noteworthy that the bootstrap confidence intervals for both positions are similar.

6.13 Are the Data Architect salaries significantly higher than those for Data Scientists? ¶

To answer this question, the two-sample T-test with unequal variance and the Mann-Whitney U test were performed based on the salary observations. Furthermore, a effect size analysis was carried out to estimate the absolute salary difference and the percentage difference between the two data job categories, the corresponding Cohen's d; as well as their bootstrap confidence intervals.

$$\mathbf{H_0}: \mu_{DR} ≤ \mu_{DS}$$$$\mathbf{H_1}: \mu_{DR} > \mu_{DS}$$$$\alpha = 0.05$$
In [ ]:
# Hypothesis testing
stat_t, pvalue_t, stat_u, pvalue_u = compare_means(DR, DS, alternative = 'greater', alpha = 0.05)
T-test: The t statistic is 0.867; and the p-value is 0.201.

Mann-Whitney U test: The U statistic is 31.500; and the p-value is 0.183.

Test interpretation: Fail to reject H0.

The obtained p-values from both tests are not significant ($p > 0.05$). Therefore, the Data Architect salaries are not significantly higher than those for Data Scientists.

In [ ]:
# Estimating the Effect Size
effect_size(DR, DS, "Data Architect", "Data Scientist", alpha = 0.05)
The mean salary difference and the percentage difference between the two data job categories is: $10,156 and 20.50%, respectively.

The Cohen's d between the two data job categories is: 0.45.
(Medium Effect Size)

Data Architect's Salary --> 95.0% Confidence Interval: ($36,833, $72,083)
Data Scientist's Salary --> 95.0% Confidence Interval: ($23,875, $71,875)


2024-01-15T22:13:25.690175 image/svg+xml Matplotlib v3.7.4, https://matplotlib.org/

So, from the effect size analysis, it is possible to conclude that the mean salary difference between Data Architect and Data Scientist positions is neither statistically nor practically significant.

Certainly, a difference of $10,156 MXN per month and a percentage difference of about 20.50% is not that huge in the Mexican labor market for this kind of positions, taking into account that the mean salaries for Data Architect and Data Scientist positions are $54,625 MXN and $44,468 MXN, respectively. And, in this sense, it is noteworthy that the bootstrap confidence intervals for both positions are similar.

6.14 Are the Data Architect salaries significantly higher than those for Business Analysts? ¶

To answer this question, the two-sample T-test with unequal variance and the Mann-Whitney U test were performed based on the salary observations. Furthermore, a effect size analysis was carried out to estimate the absolute salary difference and the percentage difference between the two data job categories, the corresponding Cohen's d; as well as their bootstrap confidence intervals.

$$\mathbf{H_0}: \mu_{DR} ≤ \mu_{BA}$$$$\mathbf{H_1}: \mu_{DR} > \mu_{BA}$$$$\alpha = 0.05$$
In [ ]:
# Hypothesis testing
stat_t, pvalue_t, stat_u, pvalue_u = compare_means(DR, BA, alternative = 'greater', alpha = 0.05)
T-test: The t statistic is 2.744; and the p-value is 0.015.

Mann-Whitney U test: The U statistic is 115.500; and the p-value is 0.007.

Test interpretation: Reject H0.

The obtained p-values from both tests are significant ($p < 0.05$). Therefore, the Data Architect salaries are significantly higher than those for Business Analysts.

In [ ]:
# Estimating the Effect Size
effect_size(DR, BA, "Data Architect", "Business Analyst", alpha = 0.05)
The mean salary difference and the percentage difference between the two data job categories is: $22,732 and 52.55%, respectively.

The Cohen's d between the two data job categories is: 1.41.
(Large Effect Size)

Data Architect's Salary --> 95.0% Confidence Interval: ($30,917, $70,000)
Business Analyst's Salary --> 95.0% Confidence Interval: ($22,776, $41,412)


2024-01-15T22:13:27.619175 image/svg+xml Matplotlib v3.7.4, https://matplotlib.org/

So, from the effect size analysis, it is possible to conclude that the mean salary difference between Data Architect and Business Analyst positions is both statistically and practically significant.

Indeed, a difference of $22,732 MXN per month and a percentage difference of about 52.55% is important in the Mexican labor market. And, in this sense, it is noteworthy that the bootstrap confidence intervals for each position are very different.

6.15 Are the Data Architect salaries significantly higher than those for BI Analysts? ¶

To answer this question, the two-sample T-test with unequal variance and the Mann-Whitney U test were performed based on the salary observations. Furthermore, a effect size analysis was carried out to estimate the absolute salary difference and the percentage difference between the two data job categories, the corresponding Cohen's d; as well as their bootstrap confidence intervals.

$$\mathbf{H_0}: \mu_{DR} ≤ \mu_{BI}$$$$\mathbf{H_1}: \mu_{DR} > \mu_{BI}$$$$\alpha = 0.05$$
In [ ]:
# Hypothesis testing
stat_t, pvalue_t, stat_u, pvalue_u = compare_means(DR, BI, alternative = 'greater', alpha = 0.05)
T-test: The t statistic is 2.995; and the p-value is 0.010.

Mann-Whitney U test: The U statistic is 101.000; and the p-value is 0.007.

Test interpretation: Reject H0.

The obtained p-values from both tests are significant ($p < 0.05$). Therefore, the Data Architect salaries are significantly higher than those for BI Analysts.

In [ ]:
# Estimating the Effect Size
effect_size(DR, BI, "Data Architect", "BI Analyst", alpha = 0.05)
The mean salary difference and the percentage difference between the two data job categories is: $24,678 and 58.36%, respectively.

The Cohen's d between the two data job categories is: 1.64.
(Large Effect Size)

Data Architect's Salary --> 95.0% Confidence Interval: ($33,750, $72,500)
BI Analyst's Salary --> 95.0% Confidence Interval: ($22,749, $39,650)


2024-01-15T22:13:29.863175 image/svg+xml Matplotlib v3.7.4, https://matplotlib.org/

So, from the effect size analysis, it is possible to conclude that the mean salary difference between Data Architect and BI Analyst positions is both statistically and practically significant.

Indeed, a difference of $24,678 MXN per month and a percentage difference of about 58.36% is important in the Mexican labor market. And, in this sense, it is noteworthy that the bootstrap confidence intervals for each position are very different.

6.16 Are the Data Architect salaries significantly higher than those for Data Analysts? ¶

To answer this question, the two-sample T-test with unequal variance and the Mann-Whitney U test were performed based on the salary observations. Furthermore, a effect size analysis was carried out to estimate the absolute salary difference and the percentage difference between the two data job categories, the corresponding Cohen's d; as well as their bootstrap confidence intervals.

$$\mathbf{H_0}: \mu_{DR} ≤ \mu_{DA}$$$$\mathbf{H_1}: \mu_{DR} > \mu_{DA}$$$$\alpha = 0.05$$
In [ ]:
# Hypothesis testing
stat_t, pvalue_t, stat_u, pvalue_u = compare_means(DR, DA, alternative = 'greater', alpha = 0.05)
T-test: The t statistic is 3.854; and the p-value is 0.005.

Mann-Whitney U test: The U statistic is 345.000; and the p-value is 0.001.

Test interpretation: Reject H0.

The obtained p-values from both tests are significant ($p < 0.05$). Therefore, the Data Architect salaries are significantly higher than those for Data Analysts.

In [ ]:
# Estimating the Effect Size
effect_size(DR, DA, "Data Architect", "DA Analyst", alpha = 0.05)
The mean salary difference and the percentage difference between the two data job categories is: $30,318 and 76.82%, respectively.

The Cohen's d between the two data job categories is: 1.94.
(Large Effect Size)

Data Architect's Salary --> 95.0% Confidence Interval: ($33,750, $72,083)
DA Analyst's Salary --> 95.0% Confidence Interval: ($19,596, $29,860)


2024-01-15T22:13:31.932176 image/svg+xml Matplotlib v3.7.4, https://matplotlib.org/

So, from the effect size analysis, it is possible to conclude that the mean salary difference between Data Architect and Data Analyst positions is both statistically and practically significant.

Indeed, a difference of $30,318 MXN per month and a percentage difference of about 76.82% is important in the Mexican labor market. And, in this sense, it is noteworthy that the bootstrap confidence intervals for each position are also very different between each other.

6.17 Are the Data Engineer salaries significantly higher than those for Data Scientists? ¶

To answer this question, the two-sample T-test with unequal variance and the Mann-Whitney U test were performed based on the salary observations. Furthermore, a effect size analysis was carried out to estimate the absolute salary difference and the percentage difference between the two data job categories, the corresponding Cohen's d; as well as their bootstrap confidence intervals.

$$\mathbf{H_0}: \mu_{DE} ≤ \mu_{DS}$$$$\mathbf{H_1}: \mu_{DE} > \mu_{DS}$$$$\alpha = 0.05$$
In [ ]:
# Hypothesis testing
stat_t, pvalue_t, stat_u, pvalue_u = compare_means(DE, DS, alternative = 'greater', alpha = 0.05)
T-test: The t statistic is 0.096; and the p-value is 0.463.

Mann-Whitney U test: The U statistic is 127.500; and the p-value is 0.401.

Test interpretation: Fail to reject H0.

The obtained p-values from the T-test and the Mann-Whitney U test were not significant ($p > 0.05$). Therefore, the Data Engineer salaries are not significantly higher than those for Data Scientists.

In [ ]:
# Effect size analysis
effect_size(DE, DS, "Data Engineer", "Data Scientist", alpha = 0.05)
The mean salary difference and the percentage difference between the two data job categories is: $919 and 2.04%, respectively.

The Cohen's d between the two data job categories is: 0.04.
(Small Effect Size)

Data Engineer's Salary --> 95.0% Confidence Interval: ($35,539, $54,317)
Data Scientist's Salary --> 95.0% Confidence Interval: ($24,562, $71,897)


2024-01-15T22:13:33.893172 image/svg+xml Matplotlib v3.7.4, https://matplotlib.org/

From the effect size analysis, it is possible to conclude that the mean salary difference between Data Engineer and Data Scientist positions is also practically not significant.

Certainly, a difference of $919 MXN per month and a percentange difference of about 2.04% is neglectable in the Mexican labor market for technology. And, in this sense, it is noteworthy that the bootstrap confidence intervals for both positions overlap in a large extent as it is clearly shown in the plot above.

6.17 Are the Data Engineer salaries significantly higher than those for Business Analysts? ¶

To answer this question, the two-sample T-test with unequal variance and the Mann-Whitney U test were performed based on the salary observations. Furthermore, a effect size analysis was carried out to estimate the absolute salary difference and the percentage difference between the two data job categories, the corresponding Cohen's d; as well as their bootstrap confidence intervals.

$$\mathbf{H_0}: \mu_{DE} ≤ \mu_{BA}$$$$\mathbf{H_1}: \mu_{DE} > \mu_{BA}$$$$\alpha = 0.05$$
In [ ]:
# Hypothesis testing
stat_t, pvalue_t, stat_u, pvalue_u = compare_means(DE, BA, alternative = 'greater', alpha = 0.05)
T-test: The t statistic is 2.839; and the p-value is 0.003.

Mann-Whitney U test: The U statistic is 487.500; and the p-value is 0.005.

Test interpretation: Reject H0.

The obtained p-values from both tests are significant ($p < 0.05$). Therefore, the Data Engineer salaries are significantly higher than those for Business Analysts.

In [ ]:
# Effect size analysis
effect_size(DE, BA, "Data Engineer", "Business Analysts", alpha = 0.05)
The mean salary difference and the percentage difference between the two data job categories is: $13,494 and 34.92%, respectively.

The Cohen's d between the two data job categories is: 0.77.
(Large Effect Size)

Data Engineer's Salary --> 95.0% Confidence Interval: ($36,665, $55,675)
Business Analysts's Salary --> 95.0% Confidence Interval: ($23,784, $41,184)


2024-01-15T22:13:35.926174 image/svg+xml Matplotlib v3.7.4, https://matplotlib.org/

From the effect size analysis, it is possible to conclude that the mean salary difference between Data Engineer and Business Analyst positions is practically significant. Indeed, a difference of about $13,494 MXN per month and a percentage difference of about 34.92% is relevant in the Mexican labor market.

6.18 Are the Data Engineer salaries significantly higher than those for BI Analysts? ¶

To answer this question, the two-sample T-test with unequal variance and the Mann-Whitney U test were performed based on the salary observations. Furthermore, a effect size analysis was carried out to estimate the absolute salary difference and the percentage difference between the two data job categories, the corresponding Cohen's d; as well as their bootstrap confidence intervals.

$$\mathbf{H_0}: \mu_{DE} ≤ \mu_{BI}$$$$\mathbf{H_1}: \mu_{DE} > \mu_{BI}$$$$\alpha = 0.05$$
In [ ]:
# Hypothesis testing
stat_t, pvalue_t, stat_u, pvalue_u = compare_means(DE, BI, alternative = 'greater', alpha = 0.05)
T-test: The t statistic is 3.303; and the p-value is 0.001.

Mann-Whitney U test: The U statistic is 453.500; and the p-value is 0.001.

Test interpretation: Reject H0.

The obtained p-values from both tests are significant ($p < 0.05$). Therefore, the Data Engineer salaries are significantly higher than those for BI Analysts.

In [ ]:
# Effect size analysis
effect_size(DE, BI, "Data Engineer", "BI Analyst", alpha = 0.05)
The mean salary difference and the percentage difference between the two data job categories is: $15,441 and 40.99%, respectively.

The Cohen's d between the two data job categories is: 0.90.
(Large Effect Size)

Data Engineer's Salary --> 95.0% Confidence Interval: ($36,350, $55,168)
BI Analyst's Salary --> 95.0% Confidence Interval: ($22,488, $39,125)


2024-01-15T22:13:37.882175 image/svg+xml Matplotlib v3.7.4, https://matplotlib.org/

From the effect size analysis, it is possible to conclude that the mean salary difference between Data Engineer and BI Analyst positions is, not only statistically significant, but also practically significant.

Certainly, a difference of $15,441 MXN per month and a percentage difference of about 40.99% is important in the Mexican labor market. And, in this sense, it is noteworthy that the bootstrap confidence intervals for both positions are different from each other as shown in the plot above.

6.19 Are the Data Engineer salaries significantly higher than those for Data Analysts? ¶

To answer this question, the two-sample T-test with unequal variance and the Mann-Whitney U test were performed based on the salary observations. Furthermore, a effect size analysis was carried out to estimate the absolute salary difference and the percentage difference between the two data job categories, the corresponding Cohen's d; as well as their bootstrap confidence intervals.

$$\mathbf{H_0}: \mu_{DE} ≤ \mu_{DA}$$$$\mathbf{H_1}: \mu_{DE} > \mu_{DA}$$$$\alpha = 0.05$$
In [ ]:
# Hypothesis testing
stat_t, pvalue_t, stat_u, pvalue_u = compare_means(DE, DA, alternative = 'greater', alpha = 0.05)
T-test: The t statistic is 5.293; and the p-value is 0.000.

Mann-Whitney U test: The U statistic is 1593.500; and the p-value is 0.000.

Test interpretation: Reject H0.

The obtained p-values from both tests are significant ($p < 0.05$). Therefore, the Data Engineer salaries are significantly higher than those for Data Analysts.

In [ ]:
# Effect size analysis
effect_size(DE, DA, "Data Engineer", "Data Analyst", alpha = 0.05)
The mean salary difference and the percentage difference between the two data job categories is: $21,081 and 60.50%, respectively.

The Cohen's d between the two data job categories is: 1.27.
(Large Effect Size)

Data Engineer's Salary --> 95.0% Confidence Interval: ($35,898, $54,400)
Data Analyst's Salary --> 95.0% Confidence Interval: ($19,518, $30,084)


2024-01-15T22:13:39.903174 image/svg+xml Matplotlib v3.7.4, https://matplotlib.org/

From the effect size analysis, it is possible to conclude that the mean salary difference between Data Engineer and Data Analyst positions is, not only statistically significant, but also practically significant.

Certainly, a difference of $21,081 MXN per month and a percentage difference of about 60.50% is large in the Mexican labor market.

6.20 Are the Data Scientist salaries significantly higher than those for Business Analysts? ¶

To answer this question, the two-sample T-test with unequal variance and the Mann-Whitney U test were performed based on the salary observations. Furthermore, a effect size analysis was carried out to estimate the absolute salary difference and the percentage difference between the two data job categories, the corresponding Cohen's d; as well as their bootstrap confidence intervals.

$$\mathbf{H_0}: \mu_{DS} ≤ \mu_{BA}$$$$\mathbf{H_1}: \mu_{DS} > \mu_{BA}$$$$\alpha = 0.05$$
In [ ]:
# Hypothesis testing
stat_t, pvalue_t, stat_u, pvalue_u = compare_means(DS, BA, alternative = 'greater', alpha = 0.05)
T-test: The t statistic is 1.331; and the p-value is 0.108.

Mann-Whitney U test: The U statistic is 122.500; and the p-value is 0.088.

Test interpretation: Fail to reject H0.

The obtained p-values from the T-test and the Mann-Whitney U test were not significant ($p > 0.05$). Therefore, the Data Scientist salaries are not significantly higher than those for Business Analysts.

In [ ]:
# Effect size analysis
effect_size(DS, BA, "Data Scientist", "Business Analyst", alpha = 0.05)
The mean salary difference and the percentage difference between the two data job categories is: $12,575 and 32.94%, respectively.

The Cohen's d between the two data job categories is: 0.69.
(Large Effect Size)

Data Scientist's Salary --> 95.0% Confidence Interval: ($24,812, $70,894)
Business Analyst's Salary --> 95.0% Confidence Interval: ($22,657, $41,298)


2024-01-15T22:13:42.092176 image/svg+xml Matplotlib v3.7.4, https://matplotlib.org/

In this case, even though the mean salary difference between Data Scientist and Business Analyst positions is not statistically significant, a difference of $12,575 MXN per month and a percentage difference of about 32.94% is not neglectable.

Therefore, notwithstanding with their lack of statistical significance, based on the effect size analysis, it is possible to state that there is a significant practical difference between the mean salary difference of the Data Scientist and the Business Analyst positions. That is the reason why an effect size analysis should be carried out along with the hypothesis testing.

6.21 Are the Data Scientist salaries significantly higher than those for BI Analysts? ¶

To answer this question, the two-sample T-test with unequal variance and the Mann-Whitney U test were performed based on the salary observations. Furthermore, a effect size analysis was carried out to estimate the absolute salary difference and the percentage difference between the two data job categories, the corresponding Cohen's d; as well as their bootstrap confidence intervals.

$$\mathbf{H_0}: \mu_{DS} ≤ \mu_{BI}$$$$\mathbf{H_1}: \mu_{DS} > \mu_{BI}$$$$\alpha = 0.05$$
In [ ]:
# Hypothesis testing
stat_t, pvalue_t, stat_u, pvalue_u = compare_means(DS, BI, alternative = 'greater', alpha = 0.05)
T-test: The t statistic is 1.543; and the p-value is 0.079.

Mann-Whitney U test: The U statistic is 119.500; and the p-value is 0.024.

Inconsistent results between the tests.

The obtained p-value from the T-test was not significant ($p > 0.05$); whereas the obtained p-value from the Mann-Whitney U test was significant ($p < 0.05$).

In this context, taking into account that the Data Scientist and BI Analysts salary observations do not follow a normal distribution, the results from the Mann-Whitney U test could be deemed as more accurate.

Therefore, all in all, the Data Scientist salaries are significantly higher than those for BI Analysts.

In [ ]:
# Effect size analysis
effect_size(DS, BI, "Data Scientist", "BI Analyst", alpha = 0.05)
The mean salary difference and the percentage difference between the two data job categories is: $14,522 and 39.03%, respectively.

The Cohen's d between the two data job categories is: 0.82.
(Large Effect Size)

Data Scientist's Salary --> 95.0% Confidence Interval: ($25,117, $69,961)
BI Analyst's Salary --> 95.0% Confidence Interval: ($22,643, $38,651)


2024-01-15T22:13:43.727173 image/svg+xml Matplotlib v3.7.4, https://matplotlib.org/

From the effect size analysis, it is possible to conclude that the mean salary difference between Data Scientist and BI Analyst positions is, not only statistically significant, but also practically significant.

Certainly, a difference of $14,522 MXN per month and a percentage difference of about 39.03% is non-neglectable in the Mexican labor market.

6.22 Are the Data Scientist salaries significantly higher than those for Data Analysts? ¶

To answer this question, the two-sample T-test with unequal variance and the Mann-Whitney U test were performed based on the salary observations. Furthermore, a effect size analysis was carried out to estimate the absolute salary difference and the percentage difference between the two data job categories, the corresponding Cohen's d; as well as their bootstrap confidence intervals.

$$\mathbf{H_0}: \mu_{DS} ≤ \mu_{DA}$$$$\mathbf{H_1}: \mu_{DS} > \mu_{DA}$$$$\alpha = 0.05$$
In [ ]:
# Hypothesis testing
stat_t, pvalue_t, stat_u, pvalue_u = compare_means(DS, DA, alternative = 'greater', alpha = 0.05)
T-test: The t statistic is 2.219; and the p-value is 0.029.

Mann-Whitney U test: The U statistic is 411.500; and the p-value is 0.003.

Test interpretation: Reject H0.

The obtained p-values from both tests are significant ($p < 0.05$). Therefore, the Data Scientist salaries are significantly higher than those for Data Analysts.

In [ ]:
# Effect size analysis
effect_size(DS, DA, "Data Scientist", "Data Analyst", alpha = 0.05)
The mean salary difference and the percentage difference between the two data job categories is: $20,162 and 58.63%, respectively.

The Cohen's d between the two data job categories is: 1.21.
(Large Effect Size)

Data Scientist's Salary --> 95.0% Confidence Interval: ($26,053, $69,375)
Data Analyst's Salary --> 95.0% Confidence Interval: ($19,430, $29,902)


2024-01-15T22:13:45.327172 image/svg+xml Matplotlib v3.7.4, https://matplotlib.org/

From the effect size analysis, it is possible to conclude that the mean salary difference between Data Scientist and Data Analyst positions is both statistically and practically significant.

Certainly, a difference of $20,162 MXN per month and a percentage difference of about 58.63% is quite large in the Mexican labor market.

6.23 Are the Business Analyst salaries significantly higher than those for BI Analysts? ¶

To answer this question, the two-sample T-test with unequal variance and the Mann-Whitney U test were performed based on the salary observations. Furthermore, a effect size analysis was carried out to estimate the absolute salary difference and the percentage difference between the two data job categories, the corresponding Cohen's d; as well as their bootstrap confidence intervals.

$$\mathbf{H_0}: \mu_{BA} ≤ \mu_{BI}$$$$\mathbf{H_1}: \mu_{BA} > \mu_{BI}$$$$\alpha = 0.05$$
In [ ]:
# Hypothesis testing
stat_t, pvalue_t, stat_u, pvalue_u = compare_means(BA, BI, alternative = 'greater', alpha = 0.05)
T-test: The t statistic is 0.434; and the p-value is 0.333.

Mann-Whitney U test: The U statistic is 252.000; and the p-value is 0.300.

Test interpretation: Fail to reject H0.

The obtained p-values from both tests are not significant ($p > 0.05$). Therefore, the Business Analyst salaries are not significantly higher than those for BI Analysts.

In [ ]:
# Effect size analysis
effect_size(BA, BI, "Business Analyst", "BI Analys", alpha = 0.05)
The mean salary difference and the percentage difference between the two data job categories is: $1,947 and 6.30%, respectively.

The Cohen's d between the two data job categories is: 0.13.
(Small Effect Size)

Business Analyst's Salary --> 95.0% Confidence Interval: ($23,594, $41,506)
BI Analys's Salary --> 95.0% Confidence Interval: ($22,560, $39,100)


2024-01-15T22:13:46.839173 image/svg+xml Matplotlib v3.7.4, https://matplotlib.org/

From the effect size analysis, it is possible to conclude that the mean salary difference between Business Analyst and BI Analyst positions is also not practically significant.

Certainly, a difference of $1,947 MXN per month and a percentange difference of about 6.30% is neglectable in the Mexican labor market.

6.24 Are the Business Analyst salaries significantly higher than those for Data Analysts? ¶

To answer this question, the two-sample T-test with unequal variance and the Mann-Whitney U test were performed based on the salary observations. Furthermore, a effect size analysis was carried out to estimate the absolute salary difference and the percentage difference between the two data job categories, the corresponding Cohen's d; as well as their bootstrap confidence intervals.

$$\mathbf{H_0}: \mu_{BA} ≤ \mu_{DA}$$$$\mathbf{H_1}: \mu_{BA} > \mu_{DA}$$$$\alpha = 0.05$$
In [ ]:
# Hypothesis testing
stat_t, pvalue_t, stat_u, pvalue_u = compare_means(BA, DA, alternative = 'greater', alpha = 0.05)
T-test: The t statistic is 2.020; and the p-value is 0.025.

Mann-Whitney U test: The U statistic is 999.000; and the p-value is 0.006.

Test interpretation: Reject H0.

The obtained p-values from both tests are significant ($p < 0.05$). Therefore, the Business Analyst salaries are significantly higher than those for Data Analysts.

In [ ]:
# Effect size analysis
effect_size(BA, DA, "Business Analyst", "Data Analyst", alpha = 0.05)
The mean salary difference and the percentage difference between the two data job categories is: $7,587 and 27.00%, respectively.

The Cohen's d between the two data job categories is: 0.49.
(Medium Effect Size)

Business Analyst's Salary --> 95.0% Confidence Interval: ($22,987, $40,867)
Data Analyst's Salary --> 95.0% Confidence Interval: ($19,582, $29,529)


2024-01-15T22:13:48.800174 image/svg+xml Matplotlib v3.7.4, https://matplotlib.org/

From the effect size analysis, it is possible to conclude that the salary difference between Business Analyst and Data Analyst positions is, not only statistically significant, but also practically significant.

Certainly, a difference of $7,587 MXN per month and a percentange difference of about 27.00% is non-neglectable in the Mexican labor market.

6.25 Are the BI Analyst salaries significantly higher than those for Data Analysts? ¶

To answer this question, the two-sample T-test with unequal variance and the Mann-Whitney U test were performed based on the salary observations. Furthermore, a effect size analysis was carried out to estimate the absolute salary difference and the percentage difference between the two data job categories, the corresponding Cohen's d; as well as their bootstrap confidence intervals.

$$\mathbf{H_0}: \mu_{BI} ≤ \mu_{DA}$$$$\mathbf{H_1}: \mu_{BI} > \mu_{DA}$$$$\alpha = 0.05$$
In [ ]:
# Hypothesis testing
stat_t, pvalue_t, stat_u, pvalue_u = compare_means(BI, DA, alternative = 'greater', alpha = 0.05)
T-test: The t statistic is 1.543; and the p-value is 0.066.

Mann-Whitney U test: The U statistic is 869.000; and the p-value is 0.008.

Inconsistent results between the tests.

The obtained p-value from the T-test was not significant ($p > 0.05$); whereas the obtained p-value from the Mann-Whitney U test was significant ($p < 0.05$).

In this context, taking into account that the BI Analyst and Data Analysts salary observations do not follow a normal distribution, the results from the Mann-Whitney U test could be deemed as more accurate.

Therefore, all in all, the BI Analyst salaries are significantly higher than those for Data Analysts.

In [ ]:
# Effect size analysis
effect_size(BI, DA, "BI Analyst", "Data Analyst", alpha = 0.05)
The mean salary difference and the percentage difference between the two data job categories is: $5,640 and 20.79%, respectively.

The Cohen's d between the two data job categories is: 0.38.
(Medium Effect Size)

BI Analyst's Salary --> 95.0% Confidence Interval: ($22,463, $38,381)
Data Analyst's Salary --> 95.0% Confidence Interval: ($19,475, $29,708)


2024-01-15T22:13:50.315692 image/svg+xml Matplotlib v3.7.4, https://matplotlib.org/

From the effect size analysis, it is possible to conclude that, even though the salary difference between BI Analyst and Data Analyst positions is statistically significant, it is not practically significant, as a difference of $5,640 MXN per month and a percentage difference of about 20.79% is not very important in the Mexican labor market for technology.

Again, here it lies the importance of peforming a effect size analysis along with the hypothesis testing.


7. Conclusions ¶


The data job category with the highest salaries in the Mexican labor market in November 2023 according to the OCC website were Data Architect and ML Engineer. Indeed, the average salary for Data Architect positions was not signicantly lower than for ML Engineers. Thus, the present study's hypothesis is rejected. However, this result must be taken with caution as only one salary observation could be retrieved for the latter position.

On the other hand, the data job category most demanded in the Mexican labor market was Data Analyst; even though it was also the one with the lowest salary. Also, this data job category is the most demanded across the different Mexican states, whereas ML Engineer and Data Architect positions were the most concentrated in certain locations, namely, Ciudad de México, Nuevo León, and Jalisco.

Moreover, Ciudad de México was the location where it is possible to find the highest jobs demand. However, Estado de México and Jalisco were locations along with remote positions where the demand of data jobs is the highest after the capital city. Similarly, in those locations the highest salaries can be found.

Furthermore, the companies with the greatest demand of data positions were Bairesdev, Banamex, Pepsico and Softek; nevertheless, the organizations that offered the highest salaries were Ecosistemex, Caspex Corp, Addon Technologies, Enterprise Solutions, and Softek, which correspond to recruiting agencies and tech consulting firms.

Additionally, it was also found that the data jobs demand from some companies spread across several Mexican states such as Bairesdev, Pepsico or Softek, and that there are some well-known organizations whose data jobs demand is not located in the capital region, such as Johnson Controls (Nuevo León).

The results of the present study suggest that Data Analyst, Business Analyst, and Data Engineer positions were more demanded across different organizations. On the contrary, Data Architect and, certainly, ML Engineer vacancies could only be found in more specific organizations like tech consultancy firms or banks.

Finally, regarding the limitations of the present study, it is important to bear in mind that the data was collected solely from the OCC website and only for a very short period of time. Thus, very few observations were obtained for some of the least demanded data jobs categories: Data Scientist, Data Architect, and ML Engineer. Also, the collected data mostly corresponded to Ciudad de México, Nuevo León, Estado de México, and Jalisco, and no distinction was made among entry level, middle and senior positions. Thus, as future perspectives, it would be advisable to gather data from more job websites, retrieve information for a longer time span, make a distinction among entry level, middle and senior positions, and collect more salary data for Data Scientist, Data Architect, and ML Engineer positions as well as for other Mexican states.


8. References ¶


  • ai-jobs.net (2023). Machine Learning Engineer vs. Business Intelligence Data Analyst. https://ai-jobs.net/insights/machine-learning-engineer-vs-business-intelligence-data-analyst/
  • Arroyo-Velázquez, I. (2022). Visualizing Mexican Wildfires. States_MX [JSON file]. https://raw.githubusercontent.com/isaacarroyov/data_visualization_practice/master/Python/visualizing_mexican_wildfires_tds/data/states_mx.json.
  • Axis Talent. (2020). The Ecosystem of Data Jobs - Making sense of the Data Job Market. https://www.axistalent.io/blog/the-ecosystem-of-data-jobs-making-sense-of-the-data-job-market
  • Olvera, E. (2023). Reporte del Mercado Laboral de TI México 2023. Hireline. https://hireline.io/mx/estudio-mercado-laboral-y-empleos-de-ti-mexico
  • Rollins, J. B. (2015). Metodología Fundamental para la Ciencia de Datos. Somers: IBM Corporation. https://www.ibm.com/downloads/cas/WKK9DX51
In [ ]:
# Compressing all figures in a ZIP file
!zip -r figures.zip Figures
updating: Figures/ (260 bytes security) (stored 0%)
updating: Figures/Fig10_AverageSalaryPerDataJobCategory.png (172 bytes security) (deflated 17%)
updating: Figures/Fig10_AverageSalaryPerDataJobCategory2.png (172 bytes security) (deflated 26%)
updating: Figures/Fig10_AverageSalaryPerDataJobCategory2.svg (172 bytes security) (deflated 86%)
updating: Figures/Fig11_SalaryPerDataJobCategory.png (172 bytes security) (deflated 22%)
updating: Figures/Fig11_SalaryPerDataJobCategory2.png (172 bytes security) (deflated 16%)
updating: Figures/Fig11_SalaryPerDataJobCategory2.svg (172 bytes security) (deflated 92%)
updating: Figures/Fig12_MeanMedianSalaryPerDataJob.png (172 bytes security) (deflated 14%)
updating: Figures/Fig12_MeanMedianSalaryPerDataJob2.png (172 bytes security) (deflated 17%)
updating: Figures/Fig12_MeanMedianSalaryPerDataJob2.svg (172 bytes security) (deflated 87%)
updating: Figures/Fig13_SalaryPerLocationAndDataJobCategory.png (172 bytes security) (deflated 19%)
updating: Figures/Fig13_SalaryPerLocationAndDataJobCategory2.png (172 bytes security) (deflated 19%)
updating: Figures/Fig13_SalaryPerLocationAndDataJobCategory2.svg (172 bytes security) (deflated 83%)
updating: Figures/Fig14_Top20SalaryPerCompany.png (172 bytes security) (deflated 18%)
updating: Figures/Fig14_Top20SalaryPerCompany2.png (172 bytes security) (deflated 23%)
updating: Figures/Fig14_Top20SalaryPerCompany2.svg (172 bytes security) (deflated 90%)
updating: Figures/Fig15_SalaryPerCompanyAndDataJobCategory.png (172 bytes security) (deflated 17%)
updating: Figures/Fig15_SalaryPerCompanyAndDataJobCategory2.png (172 bytes security) (deflated 16%)
updating: Figures/Fig15_SalaryPerCompanyAndDataJobCategory2.svg (172 bytes security) (deflated 73%)
updating: Figures/Fig16_VacanciesHighestSalaries.png (172 bytes security) (deflated 13%)
updating: Figures/Fig17_DataJobsSalaryDistribution.png (172 bytes security) (deflated 27%)
updating: Figures/Fig17_DataJobsSalaryDistribution2.png (172 bytes security) (deflated 39%)
updating: Figures/Fig17_DataJobsSalaryDistribution2.svg (172 bytes security) (deflated 86%)
updating: Figures/Fig18_SalaryDistributionsPerDataJobCategory.png (172 bytes security) (deflated 31%)
updating: Figures/Fig18_SalaryDistributionsPerDataJobCategory2.png (172 bytes security) (deflated 28%)
updating: Figures/Fig18_SalaryDistributionsPerDataJobCategory2.svg (172 bytes security) (deflated 93%)
updating: Figures/Fig1_DemandOfDataJobsPerCategory.png (172 bytes security) (deflated 10%)
updating: Figures/Fig1_DemandOfDataJobsPerCategory2.png (172 bytes security) (deflated 13%)
updating: Figures/Fig1_DemandOfDataJobsPerCategory2.svg (172 bytes security) (deflated 82%)
updating: Figures/Fig2_DemandOfDataJobsPerMexicanState.png (172 bytes security) (deflated 11%)
updating: Figures/Fig2_DemandOfDataJobsPerMexicanState.svg (172 bytes security) (deflated 57%)
updating: Figures/Fig3_DemandPerDataJobCategoryInTopLocations1.png (172 bytes security) (deflated 17%)
updating: Figures/Fig3_DemandPerDataJobCategoryInTopLocations2.png (172 bytes security) (deflated 26%)
updating: Figures/Fig3_DemandPerDataJobCategoryInTopLocations2.svg (172 bytes security) (deflated 89%)
updating: Figures/Fig4_DemandPerLocationAndDataJobCategory1.png (172 bytes security) (deflated 20%)
updating: Figures/Fig4_DemandPerLocationAndDataJobCategory2.png (172 bytes security) (deflated 17%)
updating: Figures/Fig4_DemandPerLocationAndDataJobCategory2.svg (172 bytes security) (deflated 78%)
updating: Figures/Fig5_TopCompaniesDemandingDataJobs.png (172 bytes security) (deflated 25%)
updating: Figures/Fig5_TopCompaniesDemandingDataJobs.svg (172 bytes security) (deflated 87%)
updating: Figures/Fig6_DemandPerCompanyAndDataJobCategory.png (172 bytes security) (deflated 18%)
updating: Figures/Fig6_DemandPerCompanyAndDataJobCategory2.png (172 bytes security) (deflated 16%)
updating: Figures/Fig6_DemandPerCompanyAndDataJobCategory2.svg (172 bytes security) (deflated 79%)
updating: Figures/Fig7_DemandPerCompanyAndLocationTop30.png (172 bytes security) (deflated 18%)
updating: Figures/Fig7_DemandPerCompanyAndLocationTop30.svg (172 bytes security) (deflated 76%)
updating: Figures/Fig7_DemandPerCompanyAndLocationTop30_2.png (172 bytes security) (deflated 17%)
updating: Figures/Fig8_DemandPerCompanyAndLocation.png (172 bytes security) (deflated 24%)
updating: Figures/Fig8_DemandPerCompanyAndLocation2.png (172 bytes security) (deflated 23%)
updating: Figures/Fig8_DemandPerCompanyAndLocation2.svg (172 bytes security) (deflated 70%)
updating: Figures/Fig9_NumberOfSalaryObservationsPerDataJobCategory.png (172 bytes security) (deflated 19%)
updating: Figures/Fig9_NumberOfSalaryObservationsPerDataJobCategory2.png (172 bytes security) (deflated 26%)
updating: Figures/Fig9_NumberOfSalaryObservationsPerDataJobCategory2.svg (172 bytes security) (deflated 85%)
updating: Figures/Fig_BootstrapDistributions_BI Analyst-Data Analyst.png (172 bytes security) (deflated 20%)
updating: Figures/Fig_BootstrapDistributions_Business Analyst-BI Analys.png (172 bytes security) (deflated 19%)
updating: Figures/Fig_BootstrapDistributions_Business Analyst-Data Analyst.png (172 bytes security) (deflated 19%)
updating: Figures/Fig_BootstrapDistributions_Data Architect-BI Analyst.png (172 bytes security) (deflated 19%)
updating: Figures/Fig_BootstrapDistributions_Data Architect-Business Analyst.png (172 bytes security) (deflated 19%)
updating: Figures/Fig_BootstrapDistributions_Data Architect-DA Analyst.png (172 bytes security) (deflated 19%)
updating: Figures/Fig_BootstrapDistributions_Data Architect-Data Engineer.png (172 bytes security) (deflated 19%)
updating: Figures/Fig_BootstrapDistributions_Data Architect-Data Scientist.png (172 bytes security) (deflated 19%)
updating: Figures/Fig_BootstrapDistributions_Data Engineer-BI Analyst.png (172 bytes security) (deflated 19%)
updating: Figures/Fig_BootstrapDistributions_Data Engineer-Business Analysts.png (172 bytes security) (deflated 19%)
updating: Figures/Fig_BootstrapDistributions_Data Engineer-Data Analyst.png (172 bytes security) (deflated 20%)
updating: Figures/Fig_BootstrapDistributions_Data Engineer-Data Scientist.png (172 bytes security) (deflated 19%)
updating: Figures/Fig_BootstrapDistributions_Data Scientist-BI Analyst.png (172 bytes security) (deflated 19%)
updating: Figures/Fig_BootstrapDistributions_Data Scientist-Business Analyst.png (172 bytes security) (deflated 19%)
updating: Figures/Fig_BootstrapDistributions_Data Scientist-Data Analyst.png (172 bytes security) (deflated 19%)
updating: Figures/Header.png (172 bytes security) (deflated 1%)
  adding: Figures/Fig16_VacanciesHighestSalaries2.png (172 bytes security) (deflated 13%)
  adding: Figures/Fig16_VacanciesHighestSalaries2.svg (172 bytes security) (deflated 91%)
In [ ]:
# Code for composing the Python Requirements File
!pip freeze > requirements.txt
In [ ]:
# End